monocular reconstruction of vehicles: combining...

Monocular Reconstruction of Vehicles: Combining SLAM with Shape PriorsSupplementary Material

Falak Chhaya1, Dinesh Reddy1, Sarthak Upadhyay1, Visesh Chari1, M. Zeeshan Zia2 and K. Madhava Krishna1

I. INTRODUCTION

Primary purpose of this document is for better under-standing of algorithm-1 which is Multi-view Stochastic Hill-Climbing, and that of all the terms in the main objectivefunction of the main paper. We shall start with all the termsin the main objective function and then explain Multi-viewStochastic Hill-Climbing.

II. MAIN OBJECTIVE FUNCTION

The main objective function in the main paper, equation7, contains four terms namely boxfit, detectionfit,deepfit, photofit. These terms exploit different infor-mation and come into action at different times to tacklevarious problems that we faced.

A. boxfit

Purpose of boxfit is to estimate pose of camera withrespect to object(car) for first 3-5 frames. Using this pose,we optimize for shape of the car using Multi-view StochasticHill Climbing. Ideally we should be able to do this for all theimages.But empirically, we found out, on multiple datasets,that tracks even with state-of-the-art Deep-Match/Deep-flow[2] become noisy and don’t yield great results for poseestimation after 6-7 frames in general. Thus, after optimizingfor shape of the car, we just need to optimize for posein subsequent frames. Some more explanation about howboxfit works is given below.

The features within 2D bounding box are divided intotwo sets as shown here in Figure 1 such that inliers aremaximum when we compute Homography matrices to fiteach of these two sets of features. The fact, that these twoplanes on a car are practically perpendicular, is used toeliminate incorrect combinations of Rotation and Translationmatrices that we get from decomposition of the previouslycalculated Homography matrices. Top view of reconstructionis shown in Figure 2.

B. deepfit

deepfit uses 2D projection of deformable wireframeand and deep matches contained within the closed 2D spacecontained within projection of deformable wireframe. Theprinciple idea here is that shear between above mentionedtwo entities in consecutive frames should be as minimal aspossible. An example with one feature point and one vertex

1 International Institute of Information Technology, Hyderabad, India2 Imperial College, London, UK

Fig. 1: Two planes are segmented on the car as mentioned inSection IV(B)(1) of the main paper. The fact, that these planesare practically perpendicular, is exploited in getting correct posethrough our newly proposed SfM.

Fig. 2: boxfit reconstruction as mentioned in Section IV(B)(2) ofthe main paper. It shows top view where left and rear sides ofthe car are being reconstructed(green). Camera poses at differentinstances are shown in blue

of deformable wireframe is presented below for understand-ing purpose.

Figure 3(a) and (b) are the same images at the instantt. Two points are marked on it, point on the head of thearrow is a “part“ of deformable wireframe, corresponding tothe term P(Sj(α), P (t)) in equation (5) in the main paperand point on the tail of the arrow (near the number plate)is a feature detected within the area covered by deformablewireframe(This feature is matched at next instant in (c) and(d) as well), which corresponds to the term xit. The Yellowarrow depicts the distance between these two points. This

arrow is shown in all four images for comparison.Figure 3(c) and (d) show different particles at instant

t+ 1 but they show particles with different poses. Differentparticles mean they have different pose paramters. The termxit+1 in equation (5) in the main paper is the tracked/matchedpoint of xit whereas the term P(Sj(α), P (t+1)) correspondsto the head of the white arrow shown in (c). The white andyellow arrows almost coincide in 9(d). The deepfit termD(Sj(α), P (t), P (t+ 1), χi

)is the difference between the

magnitudes of these two arrows. Considering just one featurematch and one ”part“, it can be inferred that (d) shows thecorrect particle and (c), one of the incorrect ones. It can beobserved in (c) the difference between two arrows whichhighlight shear between the matched feature point at t + 1and vertice of deformable wireframe. Only one feature matchand one ”part” is considered here just to make understandingsimple. Practically, we use all the matches and all the “parts”.

Fig. 3: (a) and (b) are the same images, which is the initial instantt. (c) and (d) are two of the different particles(estimated poses) atthe next instant t + 1, where pose in (d) is closer to the correctpose and (c) has incorrect pose.

Fig. 4: Peaks thresholded values of part detectors for wheels areshown figuratively for demonstration purpose. The particle in (a)would score lower than particle in (b).

C. detectionfit

detectionfit utilises output of local part detectorsas defined and used in [3]. For each part, we get part(log-)likelihood score at every pixel in the image from amulti-calss Random Forest as per equation (4) in the mainpaper. We sum over (log-)likelihood of all visible parts.Only the parts that are visible are considered because someparts would not be visible for given viewpoint due to self-occlusion. In order to handle this, the term oj(S(α)) inequation (4) in the main paper is a binary indicator functionfor visibility. Here part likelihood Lk is normalized bybackground likelihood Lb so that it discriminates the positiveclass from its background the most, following [1]. Finallythe likelihood is re-normalized to the number of visibleterms.

This term is used in optimizing shape as well as pose.As shown in equation (7) in the main paper, normalizedscore considering all visible parts is used for scoring whileoptimizing for shape. While in case of optimizing pose,likelihood of wheels is given quite a heavy weightage asthe peaks there is much more prominent than all other parts.Such an example is shown here in Figure 5.

Fig. 5: Yellow part in this figure shows thresholded output of partdetector for wheels.

D. photofit

photofit is used for estimating pose only when car isfar away from the camear and hence is used when heightof the car in image becomes less than 50 pixels. We usenormalized 2D cross correlation of each particle with thecorrect one in previous frame as the measure. As mentionedin the paper, we leverage on the fact that distant objectsmore or less undergo affine transformations of their texturedsurfaces, and hence the immediate texture surrounding thecorners of the wireframe models projection tend to remainintact.

III. MULTI-VIEW STOCHASTIC HILL CLIMBING

This algorithm can be seen as multi-view extension ofthe objective function presented in [3]. Apart from a 2Dbounding box and a coarse pose of car, this algorithm takespose estimated from the term boxfit as inputs and it optimizesfor shape of the car as well as gives accurate pose in the firstimage.

As shown in Algorithm-1 in the main paper, we generate250 particles with shape and pose parameters similar to

[3], except the fact that the particles in [3] don’t containinformation about loacalization of car in 3D. For everysuch particle we generate 400 candidates following a normaldistribution. For each such candidate we perform scoringand candidate with the best score is saved. Likewise, theparticle with the best score is saved as well. We perform 20such iterations and finally the particle with the best score ischosen.

Output of local part detectors as mentioned in sectionII-C is used for scoring. Each visible part of deformablewireframe is projected on the image and part likelihoodscore of corresponding part at that corresponding pixel isconsidered and sum of all the visible parts is considered asthe score for that candidate.

REFERENCES

[1] J. A.-C. A. S. L. G. F. M.-N. M. Villamizar, H. Grabner. Efficient 3DObject Detection using Multiple Pose-Specific Classifiers. In BMVC.

[2] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow:Large displacement optical flow with deep matching. In IEEE Intena-tional Conference on Computer Vision (ICCV), Sydney, Australia, Dec.2013.

[3] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. Detailed 3drepresentations for object modeling and recognition. IEEE Transactionson Pattern Analysis and Machine Intelligence, 35(11):2608–2623, 2013.

monocular reconstruction of vehicles: combining...

Documents