tsz-ho yu

Slide 1

Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests

Tsz-HoYuDanhang Tang

T-KKimSponsored by

1Good afternoon everyone. My name is Danhang Tang, from Imperial College London. Today I'm gonna introduce our work "Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests. This is a joint work with Tsz-Ho Yu from Cambridge University and my supervisor Dr. T-K Kim. And it's sponsored by Samsung.2

Before I start, let me show you a short clip of our results so you can have an intuitive idea of our work2MotivationMultiple cameras with invserse kinematics[Bissacco et al. CVPR2007][Yao et al. IJCV2012][Sigal IJCV2011]

Learning-based (regression)[Navaratnam et al. BMVC2006][Andriluka et al. CVPR2010]Specialized hardware(e.g. structured light sensor, TOF camera)[Shotton et al. CVPR11][Baak et al. ICCV2011][Ye et al. CVPR2011][Sun et al. CVPR2012]

3The motivation behind is, there are many successful existing human pose estimation methods.

MotivationDiscriminative approaches (RF) have achieved great success in humanbody pose estimation.Efficient real-timeAccurate frame-basis, not rely on trackingRequire a large dataset to cover many posesTrain on synthetic, test on real dataDidnt exploit kinematic constraintsExamples:Shotton et al. CVPR11, Girshick et al. ICCV11, Sun et al. CVPR12

Among them using Random Forest on depth data recently becomes very successful. It is very efficient, can run in real-time. It is also very accurate, even the single frame-based gives satisfying results. Because they are learning-based methods, which requires a large dataset to cover as many poses as possible. People often train on synthetic data and test on real data. Also they normally don't exploit kinematic constraints.4Challenges for Hand?Labeling is difficult and tedious!

Viewpoint changes and self occlusions

Discrepancy between synthetic and real data is larger than human bodyOur methodHierarchicalHybridForestTransductiveLearning

Semi-supervisedLearningLabeling is difficult and tedious!


Discrepancy between synthetic and real data is larger than human bodyExisting ApproachesGenerative approachesModel-fittingNo training is required

Oikonomidis et al. ICCV2011

De La Gorce et al. PAMI2010

Hamer et al. ICCV2009

Motion captureBallan et al. ECCV 2012 Slow Needs initialisation and tracking

Discriminative approachesSimilar solutions to human body pose estimationPerformance on real data remains challenging

Wang et al. SIGGRAPH2009

Stenger et al. IVC 2007

Keskin et al. ECCV2012Discriminative approachesSimilar solutions to human body pose estimationPerformance on real data remains challenging

Xu and Cheng ICCV 2013

Moreover, Id like to point out that in this years ICCV, Xu and Cheng also proposed a solution to address these issues and achieved good results on real data. Unlike our method, they tried to model sensor noise explicitly.7Our methodHierarchicalHybridForestLabeling is difficult and tedious!


Discrepancy between synthetic and real data is larger than human bodyHierarchical Hybrid ForestSTR forest:

Qa View point classification quality (Information gain)Viewpoint Classification: Qa

Qapv = Qa + (1-)QP + (1-)(1-)QVThe idea is to break the problem of pose estimation down into a 3-level coarse-to-fine search. In each level we use a term to measure the quality of split functions. The first level, we label training data with their viewpoint, which is the normal to the palm. And the measuring term Q_a is set to the information gain of a split.

9Hierarchical Hybrid ForestSTR forest:

Qa View point classification quality (Information gain)Qp Joint label classification quality (Information gain)Viewpoint Classification: Qa Finger joint Classification: Qp

Qapv = Qa + (1-)QP + (1-)(1-)QVAt the second level, pixels are classified into one of the 16 finger joints. Therefore measuring term Q_p is also an information gain on this classification.10Hierarchical Hybrid ForestSTR forest:

Qa View point classification quality (Information gain)Qp Joint label classification quality (Information gain)Qv Compactness of voting vectors (Determinant of covariance trace)

Viewpoint Classification: Qa Finger joint Classification: QpPose Regression: Qv

Qapv = Qa + (1-)QP + (1-)(1-)QVAt the third level, in order to vote for occluded joints. We use regression and hough voting here. Q_v is designed to minimise the variance of voting vectors to all joints.11Hierarchical Hybrid ForestSTR forest:

Qa View point classification quality (Information gain)Qp Joint label classification quality (Information gain)Qv Compactness of voting vectors (Determinant of covariance trace)(,) Margin measures of view point labels and joint labelsViewpoint Classification: Qa Finger Joint Classification: QpPose Regression: Qv

Qapv = Qa + (1-)QP + (1-)(1-)QVUsing all three terms together is slow. Therefore we design an adaptive switching scheme to control the coefficients in a binary form. For example, at the first level, we keep monitoring the purity of viewpoints at each node. If it is pure enough, we will switch to the next term, which is finger joint classification.12Our methodTransductiveLearning

Semi-supervisedLearningLabeling is difficult and tedious!


Discrepancy between synthetic and real data is larger than human body13Transductive learningTraining data D = {Rl, Ru, S}: labeled unlabeled

Target space(Realistic data R)Realistic data R: Captured from Primesense depth sensorA small part of R, Rl are labeled manually (unlabeled set Ru)Source space(Synthetic data S)Synthetic data S:Generated from an articulated hand model. All labeled.

The idea of transductive learning is to transfer knowledges from one domain to another. In our case, the first domain is synthetic training data, which is generated with an articulated CAD model. Therefore we have labels for all pixels. The second domain the realistic data. which captured with primesense depth sensor. Only a small part of real data is manually labeled.

14Transductive learningTraining data D = {Rl, Ru, S}:Realistic data R: Captured from KinectA small part of R, Rl are labeled manually (unlabeled set Ru)Synthetic data S:Generated from a articulated hand model, where |S| >> |R|

Source space(Synthetic data S)Target space(Realistic data R)

As mentioned before, the hierarchical hybrid forest will divide the feature space in a coarse-to-fine manner.15Transductive learningTraining data D = {Rl, Ru, S}:Similar data-points in Rl and S are paired(if separated by split function give penalty)


To transfer the knowledges, during training, for each labeled data in realistic domain, we use nearest neighbour search to find a closest data point in synthetic domain and form a pairing relationship between them. During training, if a split function separate this pair into two childnodes, a penalty will be given to it. ----- Meeting Notes (06/12/2013 06:53) -----With this transductive pairing term, the training process will tend to select those split functions that best satisfy these two domains.16Semi-supervised learningTraining data D = {Rl, Ru, S}:Similar data-points in Rl and S are paired(if separated by split function give penalty)Introduce a semi-supervised term to make use of unlabeled real data when evaluating split function


To make use of unlabelled real data, we adopted a semi-supervised term to minimise the appearance variance of patches for each node. With this semi-supervised term and the pairing term working together, we can transfer the knowledge from synthetic to real, from label to unlabel. And train a classifer that work well on both in one go.17Kinematic refinement

Due to the memory restriction, trees cannot be grown too deep. Therefore the voting results can be ambiguous sometimes. To be able to select correct positions from finger joints proposals produced by STRForest. We adopted a data-driven kinematic refinement step afterwards. The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that are close to the result of the query.

----- Meeting Notes (06/12/2013 08:12) -----The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that are close to the result of the query.

----- Meeting Notes (06/12/2013 08:36) -----The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that ar e close to the result of the query.

----- Meeting Notes (06/12/2013 08:12) -----

The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that are close to the result of the query.18Experiment settings19Evaluation data:Three different testing sequencesSequence A --- Single viewpoint(450 frames)Sequence B --- Multiple viewpoints, with slow hand movements(1000 frames)Sequence C --- Multiple viewpoints, with fast hand movements(240 frames)

Training data:Synthetic data(337.5K images)Real data(81K images,

tsz-ho yu

Documents

synthetic data

real datadidnt

depth data

self occlusions discrepancy

viewpoint changes

based methods

good results

challenging xu