indoor augmented reality using 3d scene reconstructioncg133bt2261/... · the current implementation...

Indoor Augmented Reality using3D Scene Reconstruction

Saumitro DasguptaDepartment of Computer Science

Stanford UniversityStanford, CA

[email protected]

Abstract—This project report describes the design and im-plementation of a mobile application that works in conjunctionwith a structure and layout estimation module to generate aninteractive model of an indoor scene from a series of imagescaptured using the mobile device. In particular, methods forautomatically frame selection, motion estimation, and imagemosaicing are investigated.

I. INTRODUCTION

The field of 3D scene reconstruction is an active area ofresearch where significant progress has been made in the pastdecade. However, existing implementations are usually limitedby one or more of the following constraints:

• Require specialized imaging hardware (for instance,Google’s Street View[1]).

• Require access to large datasets (for instance, asdescribed in [5]).

• Provide reconstruction as point clouds.

These constraints limit their use in consumer-level appli-cations. For our project, we design and implement a mobileapplication that, in conjunction with a server based processingsystem, is capable of reconstructing an interactive model of anindoor scene. It works on off-the-shelf commercially availablemobile devices (such as Apple’s iPad and iPhone).

Fig. 1. The estimated layout overlayed on top a static frame. The test modelcan be interactively positioned anywhere within the room by the user.

Mobile

Acquisition

Rendering / AR

Server

Structure from Motion

Layout Estimation

Fig. 2. Architectural Overview.

II. SYSTEM OVERVIEW

A. Architecture

Our system comprises of the following modules:

• Mobile. Tasks handled include automatic video frameselection, realtime image mosaicing, 3D rendering andscene augmentation with virtual models.

• Server. Tasks handled include structure from motioncalculation, camera parameter and layout estimation.

B. Operation

The process from start to finish can be grouped intothree broad stages described below. A schematic overview isprovided in figure 2.

1) Acquisition: The user pans their mobile device capturinga video of the room. The mobile application analyzes eachframe and automatically determines the ones to preserve forfurther processing. In additional, it generates a mosaic fromthe captured frames to provide a visual cue to the user.

Once a sufficient number of frames have been captured,they are uploaded to the server to initiate the second stage ofthe process.

2) Estimation: The uploaded frames are fed into astructure-from-motion(SfM) estimator, which produces a pointcloud approximation of the room. However, this point cloud isusually quite sparse. As such, it cannot be directly utilized forour AR purposes. Therefore, the next step is to feed the outputof the SfM stage into a layout estimator, which attempts tofind the best possible cube to parametrize the room (similar tothe methods described in [4]). Once determined, the estimatedcamera parameters and room layout are sent to the mobileapplication.

3) Rendering and Interaction: The mobile application usesthe estimated camera and layout parameters to overlay aperspective projected cube on top of the captured scene images.The user can now interact with this reconstructed scene byinserting 3D models and images.

For the remainder of this project report, the primary focuswill be the image-based algorithms used in the mobile appli-cation layer. The structure from motion and layout estimationcomponents are abstracted as black boxes.

III. CAMERA CALIBRATION

The methods described in the sections that follow (as wellas the server-side processes) assume that the intrinsic param-eters of the mobile device’s camera are known. Therefore, asa preliminary step, we first perform camera calibration for ourdevice. The standard chessboard pattern is used for estimatingthe camera intrinsics using a method based on [9] and [2]. Theresulting intrinsics matrix has the form:

K =

[fx 0 cx0 fy cy0 0 1

]

The auto-focus functionality of the mobile device is dis-abled during the image acquisition phase.

IV. AUTOMATIC FRAME SELECTION HEURISTIC

This section describes the first stage of the pipeline, wherethe user acquires a video of the room by panning the mobiledevice around a portion of the room. Each input frame It isprocessed by a selection function S, which returns a binaryvalue ξt indicating whether the frame is to be preserved ornot:

ξt = S(It, Rt), ξt ∈ [0, 1]

The second argument of this function, Rt, is a reference frameagainst which It is compared, such that:

Rt =

I0 if t = 0

Ix : x = argmaxx∈[0,t)

[ξx · x] otherwise

To ensure the validity of Rt as defined above, we enforceS(I0, R0) = 1.

We implement our frame selection function S in multiplesequential stages. An additional consideration to keep in mindis that this function must operate in real time on consumergrade mobile hardware.

TABLE I. PARAMETERS USED FOR ORB KEYPOINT DETECTION

Parameter ValueNumber of features 400Number of pyramid levels 3Scaling factor 1.2

A. Keypoint detection

The first stage in the frame selection process is to detect aset of keypoints, ft in the given frame It:

ft = Γ(It)

For most cases, a reasonable choice for Γ is SIFT or SURF.However, we found them to be too slow for use in our realtimeprocessing pipeline. For this reason, we decided to use ORB(Oriented FAST and Rotated BRIEF) as our feature detector.ORB is two orders of magnitude faster than SIFT, but performsas well in many situations[8]. The parameters used for ORB(determined empirically) are described in Table I.

B. Feature Matching

Once we have our set of keypoints ft, we proceed to matchthem against the keypoints of the reference frame, Γ(Rt).For binary based features such as ORB, an effective distancemetric is the Hamming distance (denoted by H below) [7].Since the Hamming distance can be computed very efficiently,it is feasible to use exhaustive search for our matching. Foreach keypoint p ∈ ft, we let

(p, P (i)

), P ∈ Γ(Rt) denote

the ith nearest matched keypoint. To establish our final set ofmatches, M , we select only those matches which satisfy thefollowing ratio test (similar to the one described in Lowe’sSIFT paper[6]):

Mt ={(p, P (1)

): H

(p, P (1)

)< Cm ·H

(p, P (2)

)},∀p ∈ ft

For our implementation we used Cm = 0.8.

C. Model Validation

1) Minimum Match Count Constraint: In this stage, we usethe correspondences found in the previous stage to determinewhether the given frame is suitable for further processing. Inorder to be considered, a frame It must produce greater than Tmatches in the previous stage. That is, we immediately rejectany frame for which |Mt| < T . For our implementation, weused T = 30.

2) Minimum Disparity Constraint: For a given keypoint p,let x(p) denote its image coordinates. The following coordinatetransformation is performed:

X(p) = K−1 · x̃(p)

where x̃(p) is homogeneous representation of the image co-ordinates and K is the camera intrinsics matrix describedin section III. Next, the disparity for each matched pair isobtained using:

Dt = {d(p, P ) : ∀(p, P ) ∈Mt}d(p, P ) = ||X(p)−X(P )||2

where || · ||2 represents the L2 norm. The median value of Dt

is compared against a threshold Cd. If this median value is

below Cd, we declare the disparity between Rt and It to betoo short, and reject the frame.For our implementation, we use Cd = 32.

3) Model Comparison: In this step, we calculate twotransformations between the matched pairs:

1) A homography is estimated for correspondences inMt using a RANSAC based method. The correspond-ing set of inliers (as determined using RANSAC) isdenoted by φt,h.

2) Similarly, the fundamental matrix is estimated and itsset of inliers is denoted by φt,f .

Then, our final decision ξt is determined as follows:

ξt =

{1 if |φt,f | > Cn · |φt,h|0 otherwise

where Cn is a constant fraction. For our implementation, weuse Cn = 1.2.

V. IMAGE MOSAICING

A. Requirements

As the user pans the camera around the room, we wish togenerate a visualization that demonstrates the captured areasof the room. A reasonable choice for such a visualization is animage mosaic formed using the automatically selected frames(as described in section IV). However, there are certain con-straints which prevent us from using conventional mosaicingmethods. These are as follows:

1) We wish to minimize camera rotation (a requirementof the layout estimator). This rules out methodsdominated by rotation-based models (such as the oneemployed by Microsoft’s PhotoSynth).

2) We reject frames which are likely to be related viaa homography. This rules out simple image warpingbased methods.

3) Our mosaicing method must be capable of handlingtwo degrees of freedom. This rules out uni-directionalstitching methods (such as the ones used by moststock panorama applications).

4) Our mosaicing method must be an “online” algo-rithm, capable of incorporating unordered frames.The visualization must also incorporate the live videopreview into the mosaic.

One approach that would satisfy the constraints abovewould be the following:

1) Estimate the linear motion of the camera relative tothe last established reference.

2) Composite the next selected frame at the estimatedimage offset.

3) Update the location of the live video preview basedon the determined offset.

The challenging part in the algorithm described above isdetermining the linear motion. In principle, the linear motioncan be obtained by double integrating the output of thedevice’s accelerometer, after adjusting for the acceleration dueto gravity g. However, the amplification of noise and the biasdue to g renders the position estimate unusable.

B. Image based translation estimation

We utilize the ORB keypoints generated during the frameselection process to estimate the translation. For each selectedframe It, we construct a set of translation estimates using theRANSAC inliers determined in the stage described in sectionIV-C3:

∆t = {x(p)− x(P ) : ∀(p, P ) ∈ φt,f}

The translation estimate for that frame is then determined as:

δt = median[∆t]

VI. INTERACTIVE VIRTUAL SCENE SETUP

Once the server completes the layout estimation, it providesthe mobile client with the following values:

1) The vertices of the estimated cube that describes theroom.

2) The estimated camera parameters for each frame.3) The estimated rotation matrix R and translation vec-

tor t for each frame.

These parameters can now be used for configuring ourrendering engine to match the scene. The rendering layer isoverlayed on top of the static frame image. The user can nowinsert 3D models into the scene and move them around.

These models can interact with the scene by automaticallysnapping to the nearest plane. This is currently accomplishedusing ray-plane intersection to determine the appropriate planefor positioning.

VII. IMPLEMENTATION

A. External Dependencies

For image processing and computer vision based al-gorithms, our implementation makes extensive use of theOpenCV library[3]. Specifically, it’s used for the followingoperations:

• ORB keypoint detection

• k-NN matching

• Homography estimation

• Fundamental matrix estimation

• RANSAC

For 3D rendering, we utilize the cross-platform gameengine Unity3D. The associated scripts for the engine (forinstance, those responsible for setting the appropriate scenetransformations, detecting model-plane collision, etc) are writ-ten in C#.

B. Platforms

The current implementation targets iOS based devices -namely the iPhone and iPad. While the user interface iswritten in Objective-C, the core image-based algorithms areimplemented as a cross-platform C++ module.

The rendering engine (and its associated scripts) can beexported to both iOS and Android without any modifications.

Fig. 3. Keypoint tracking: ORB performs well even in relatively texture-lessand uniform regions.

VIII. RESULTS

A. Automatic Frame Selection

1) Feature Selection and Tracking: We found ORB toperform well for consistent detection of keypoints. Figure3 shows ORB keypoints overlayed on top of two shiftedframes. Notice that ORB detects reasonable features whichare consistent across frames.

We also found ORB to be sufficiently fast for use in ourrealtime pipeline.

2) Quality of Frames: The quality of the frames selectedby our heuristic was measured by observing the output of thestructure from motion (SfM) stage. If the SfM stage produces areasonable point cloud output, we expect the layout estimationengine to perform correctly as well. Analyzing the SfM outputalso allows us to decouple any potential issues with the layoutestimation from that of frame selection.

Using the method described above, we found that ourframe selection works as expected. Figure 4 shows the outputgenerated by the SfM layer, alongside the frames selectedusing our method. Observe that the point cloud reconstructionaccurately captures the geometry of the room.

B. Image-based Motion Estimation and Mosaicking

The performance of our algorithm was evaluated by mov-ing the device in multiple arbitrary paths. For a set of these,the direction along each axis was also arbitrarily reversedmid-path. While these movements were primarily restricted totranslation, it was not strictly enforced - minor rotations (asone would expect from an average user) were present.

1) Motion Estimation: We found that our algorithm wasable to predict the linear motion of the device with reasonableaccuracy in each case, with the following exceptions:

Fig. 4. Frames selected by our method and the corresponding point cloudgenerated using SfM

• When the device was moved very rapidly, such thatsufficient correspondences could not be establishedbetween frames.

• When the device was capturing an area without anytrackable features. For instance, if the entire frameonly included a uniform section of the wall with notexture.

In both of these cases, the tracking resumes as expected ifthe user moves the device back over the last tracked referenceframe (as described in Section IV).

C. Image Mosaicking

We found that our realtime mosaicing algorithm worked asexpected for most cases. Figure 5 shows a mosaic generatedusing our method. A drop-shadow is included for each compos-ited frame to make it easily discernible. The yellow-borderedframe corresponds to the live video preview integrated into themosaic.

We observed the following:

• Slight offsets from the ground truth can be observedfor certain tiles. However, these are not particularlydrastic - the overall composition provides a goodestimate of the mapped area.

• Minor rotations are tolerated by the algorithm. Thesemanifest as minor alignment errors.

• Significant rotation during camera movement resultsin drastic artifacts, as expected.

The rotation issues described above can be compensated bytracking the relative rotation between frames using the device’sgyroscope. However, since our upstream algorithms requireminimal rotation for successful reconstruction, we decided

Fig. 5. A mosaic generated using our algorithm. The yellow bordered frameis the current frame.

to leave it unmodified. This will allow the user to visuallyestimate the quality of the scene capture by observing theperformance of our mosaicing.

IX. CONCLUSION

We developed a mobile application for augmented realitythat works in conjunction with a layout estimator to produce aninteractive 3D reconstruction of an indoor scene. We designedand demonstrated a heuristic for automatically selecting videoframes suitable for performing structure from motion estima-tion. Finally, we demonstrated an image-based method formotion tracking that can be effectively used for constructingrealtime mosaics.

X. ACKNOWLEDGMENTS

I would like to thank Prof. Silvio Savarese, Prof. Fei-FeiLi, and Dr. Roland Angst for their assistance with the project.I would also like to thank Prof. Bernd Girod and David Chenfor their support throughout the course.

XI. APPENDIX

A. Work Breakdown

The iOS mobile client and its associated algorithms (frameselection, image mosaicing, etc) were written by SaumitroDasgupta1.

The Unity3D scripts were written by Matt Vitelli1 andSaumitro Dasgupta.

The server side components (SfM, layout estimation) arebased on an existing project by Sid Yingze Bao3 and SilvioSavarese1. It was extended for a CS231A class project bySaumitro Dasgupta, Ayesha Khwaja2 and Matt Vitelli.

1Department of Computer Science, Stanford University2Department of Electrical Engineering, Stanford University3EECS Department, University of Michigan at Ann Arbor

REFERENCES

[1] ANGUELOV, D., DULONG, C., FILIP, D., FRUEH, C., LAFON, S.,LYON, R., OGALE, A., VINCENT, L., AND WEAVER, J. Google streetview: Capturing the world at street level. Computer 43, 6 (2010), 32–38.

[2] BOUGUET, J.-Y. Camera calibration toolbox for matlab.[3] BRADSKI, G. The OpenCV Library. Dr. Dobb’s Journal of Software

Tools (2000).[4] FURLAN, A., MILLER, S., SORRENTI, D. G., FEI-FEI, L., AND

SAVARESE, S. Free your camera: 3D indoor scene understanding fromarbitrary camera motion. In British Machine Vision Conference (BMVC)(2013), p. 9.

[5] LI, Y., SNAVELY, N., HUTTENLOCHER, D., AND FUA, P. Worldwidepose estimation using 3d point clouds. In Computer Vision–ECCV 2012.Springer, 2012, pp. 15–29.

[6] LOWE, D. G. Object recognition from local scale-invariant features. InComputer vision, 1999. The proceedings of the seventh IEEE interna-tional conference on (1999), vol. 2, Ieee, pp. 1150–1157.

[7] MUJA, M., AND LOWE, D. G. Fast matching of binary features. InComputer and Robot Vision (CRV), 2012 Ninth Conference on (2012),IEEE, pp. 404–410.

[8] RUBLEE, E., RABAUD, V., KONOLIGE, K., AND BRADSKI, G. Orb: anefficient alternative to sift or surf. In Computer Vision (ICCV), 2011IEEE International Conference on (2011), IEEE, pp. 2564–2571.

[9] ZHANG, Z. A flexible new technique for camera calibration. PatternAnalysis and Machine Intelligence, IEEE Transactions on 22, 11 (2000),1330–1334.

indoor augmented reality using 3d scene reconstructioncg133bt2261/... · the current implementation...

Documents