in-situ 3d indoor modeler with a camera and self …...in-situ 3d indoor modeler with a camera and...

In-Situ 3D Indoor Modeler with a Camera and Self-Contained Sensors

Tomoya Ishikawa1 , Kalaivani Thangamani12, Masakatsu Kourogi1, Andrew P. Gee3, Walterio Mayol-Cuevas3,

Keechul Jung4 and Takeshi Kurata1,

1 National Institute of Advanced Industrial Science and Technology, Japan 2 University of Tsukuba, Japan, 3 University of Bristol, United Kingdom

4 Soongsil University, Korea [email protected]

Abstract. We propose a 3D modeler for supporting in-situ indoor modeling effectively. The modeler allows a user easily to create models from a single photo by interaction techniques taking advantage of features in indoor space and visualization techniques. In order to integrate the models, the modeler provides automatic integration functions using Visual SLAM and pedestrian dead-reckoning (PDR), and interactive tools to modify the result. Moreover, for preventing shortage of texture images to be used for the models, our modeler automatically searches from 3D models created by the user for un-textured regions and intuitively visualizes shooting positions to take a photo for the regions. These functions make it possible that the user easily create photorealistic indoor 3D models that have enough textures on the fly.

Keywords: 3D indoor modeling, Mixed reality, Virtualized object, Visual SLAM, Pedestrian dead-reckoning, Self-contained sensor

1 Introduction

Virtualized real objects made from photos enable virtual environments to enhance reality. This reduces the gap between the real and virtual world for a number of applications such as pre-visualization for online furniture-shopping, walk-through simulation, and so on. In particular, recently, establishing self-localisation methods [1] in indoor environment prompts some attempts in plants and offices for cut-down of unnecessary human movements and prediction of unsafe behaviors based on human traffic lines estimated by the methods. For analyzing these data by visualization, photorealistic indoor 3D models made from real environments are quite useful. In addition, 3D models have come into use for navigation systems [2] not only in outdoor but also indoor environments. In the system, 3D models similar to the real world are expected that they can help the users understand the position and direction intuitively. However, to create photorealistic indoor 3D models is still difficult task except for the professionals. In this research, we propose an interactive indoor 3D modeler which enables a user to create 3D models effectively and intuitively for augmenting the reality of the applications described above.

tomoyo

テキストボックス

13th International Conference on Human-Computer Interaction (HCII2009) in San Diego, CA, USA, LNCS 5622 pp.454-464 (2009)

In our proposed modeler, the user creates local models from input photos captured at different positions individually as a unit of modeling process. The local models are effectively integrated into a global model by using Visual SLAM [3] which can create “sparse” maps of landmarks from video sequences quickly, pedestrian dead-reckoning (PDR) [1] with self-contained sensors, and simple user interaction techniques. The 3D modeling based on a single photo allows the user to easily create photorealistic models which are texture-mapped with high-resolution and high-quality photos compared with video sequences which often contain motion blur and are generally lower-quality than photos. On the other hand, video sequences are suitable for capturing wide areas at a short time. Our modeler complementarily uses those properties for integrating local models.

In order to create indoor 3D models easily only from a single photo, our modeler utilizes interaction techniques taking advantages of features of indoor environments and geometric constraints from a photo, and also utilizes visualization techniques [4]. Hereby, it is possible to realize in-situ modeling. For stable and accurate integration of created local models, our modeler provides a two-stage registration function that consists of automatic functions and interaction techniques. Furthermore, the modeler helps the user creates more complete models by automatic un-textured-region detection and view recommendation to capture the regions.

This paper is organized as follows. Section 2 describes related works and Section 3, 4, and 5 present the overview of our proposed modeler, local modeling from a single photo, and global modeling for integrating local models respectively. Finally, in Section 6, conclusions and future prospects are summarized.

2 Related Work

3D modeling methods from photos can roughly be classified into two types. One is automatic modeling methods that can automatically reconstruct 3D models without interaction techniques. The other is manual / semi-automatic modeling methods with interaction techniques.

A state-of-the-art automatic modeling method has been proposed by Goesele et al. [5]. Their method reconstructs 3D models by using Structure-from-Motion (SfM) [6] which estimates camera parameters and 3D structural information of scenes by using stereo matching to obtain dense 3D shapes from photos. In stereo methods, the scene objects have to be captured at a number of different viewpoints observing overlapped regions for creating accurate 3D models. Therefore, it is time consuming to capture enough photos or video sequences of indoor environments which require inside-out video acquisition. In addition, the computational cost becomes higher when the video sequences are long, and the accuracy often does not meet the practical needs.

Manual / semi-automatic modeling methods can produce high-quality models by taking advantage of the users’ knowledge. Google SketchUp [7] provides sketch interfaces on photos for creating 3D models, but the photos are used only for matching photos and 3D models. The system proposed by Oh et al. [8] utilizes geometric information from an input photo for constraining 3D models on LoS (Lines

of Sight) while the user models and edits them. However, this system requires a large amount of time in order to divide the photo into regions.

All processes of modeling in automatic methods using SfM are broken down by failures of estimating correspondences among photos. To compensate for the weakness, Debevec et al. [9] have proposed the semi-automatic method that can carry out stable SfM and creation of models consisting of basic primitives by manually adding correspondences between edges on 3D primitives and of images. In this method, however, target objects have to be approximated by the pre-determined basic primitives.

Sinha et al. [10] and van den Hengel et al. [11] have proposed interactive 3D modelers using a sparse map and camera parameters estimated by SfM. These systems strongly utilize data from SfM for reducing manpower needed to modeling by assuming SfM must be able to estimate all parameters successfully. Accordingly, when SfM cannot work, all of the modeling processes are broken. Furthermore, in case that the created models have critical un-textured regions, the user has to re-visit the site for capturing texture images of the regions again.

A way for preventing such a shortage of texture images is to realize in-situ modeling. In terms of this strategy, Neubert et al. [12] and Bunnum and Mayol-Cuevas [13] have proposed 3D modelers, which can effectively and quickly creates 3D models nearby target objects. However, the created models are simple wireframe models for tracking the objects, so they are not suitable for our target applications.

3 Overview of 3D Indoor Modeler

Fig. 1 shows the flowchart of our proposed modeler. In the modeler, as the pre-processing, sparse maps in a global coordinate system are created by Visual SLAM and PDR for easily achieving integration process which describes below. Note that we assume intrinsic camera parameters are estimated by conventional camera calibration methods.

Then, the user takes a photo, creates a local model from the photo, and integrates it into the global coordinate system iteratively in order to create the global model as the whole indoor environment model. The local models are created in the local coordinate systems estimated from vanishing points on each input photo. In the local coordinate systems, the user can effectively create 3D models by interaction techniques utilizing features of indoor environments. Furthermore, during the local modeling, the user can easily comprehend the shapes of models being created by viewpoint change, real-time mixed mapping of projective texture mapping (PTM) and depth mapping, and smart secondary view, which is adaptively controlled. For integrating local models, our modeler estimates the transform parameters between the local and global coordinate systems by means of sparse maps of landmarks generated in pre-processing and the result of image feature matching, and also our modeler provides interactive tools for more stable parameter estimation. These functions enable the user to integrate models robustly without massive time consumption. After the integration, the modeler automatically detects un-textured regions in the global model and displays the regions, a recommended viewpoint for taking a texture image of the regions, and the position

and direction of the user based on PDR for creating the more complete model. By above supportive functions, the user is able to create indoor 3D models effectively on the fly.

Sparse mapping by Visual SLAM & PDR

Wide‐ranging video capture

Transform‐parameter estimation

Interactive 3D modeling & checking

Semi‐automatic Integration of local & global models

Automatic detection of un‐textured regions

Output of global 3D model

Output of sparse map & camera parameters

Shooting photo

Complete global model?

View recommendation

Iterative 3D modelingPreprocessing

Sparse maps & camera parameters

No

Yes

Fig. 1. Flowchart of our proposed modeler.

4 Local Modeling from Single Photo

4.1 Transform-Parameter Estimation

In indoor spaces, floors, walls, and furniture are typically arranged in parallel or perpendicularly with respect to each others. Such features facilitate the modeling by applying an orthogonal coordinate system to floors and walls occupying large areas of a photo. Our proposed modeler utilizes these features and estimates transformation parameters between the local coordinate system and the camera coordinate system by CV-supported simple user interactions.

An local coordinate system can be constructed by selecting two pairs of lines that are parallel in the actual 3D room. The modeler first executes the Hough transform to detect lines on the photo, and then displays the lines to the user (Fig. 2-(a,b)). By clicking the displayed lines, the user can provide pairs of parallel lines to the modeler. The 2D intersection points of the selected lines are vanishing points for the photo. From the two vanishing points {e1,e2}and the focal length f of the camera f, the rotation matrix between the local coordinate system and camera coordinate system R is given by the following equation.

where x , y T i 1,2 x , y , f .⁄, ,

After estimating R, the origin of ground plane which corresponds to x-y plane in the local coordinate system is set by the user. This manipulation allows the modeler to determine the translation vector from the local coordinate system to the camera coordinate system. Moreover, the ground plane can be used to place the local model from a photo to the global coordinate system on the assumption that both ground planes in each coordinate system lay on the same plane. When the ground regions are not captured on a photo, the user should set a plane parallel to the ground plane instead.

T

(a) (b) (c)

Fig. 2. Transform-parameter estimation from single photo. (a): Input photo, (b): Detected lines by Hough transform, (c): Setting ground plane by user-interaction.

4.2 Interactive tools

Assuming that each object in an indoor photo can be modeled with a set of quadrangular and freeform planes, our modeler provides two types of tools to create planes for the user.

Quadrangular tool: creates a 3D quadrangular plane by giving the 3D coordinates of the opposing corners through mouse clicks. This tool is suitable for simple objects such as floors, walls, tables, and shelves.

Freeform tool: creates a 3D freeform plane by giving a set of 3D points laying on the contour through repeated mouse clicks. This tool is used for more complex objects.

For both tools, the depth of the first established point can be given by calculating the intersection between the line of sight passing through the clicked point on the photo and the nearest plane to the optical center of the photo if there exists such an intersection. From the initial viewpoint corresponding to the optical center of the photo, the user can easily understand the correspondence between the photo and the model being created. In particular, in the case of the freeform tool, the interaction to set contour points in 3D planes is same as 2D interaction with the photos, thus the user can create models intuitively.

During these interactions, the normal of each plane can be toggled along several default directions such as the x-y, y-z, and z-x planes. The function is especially effective in artificial indoor environments. In addition, the user can create models by means of real-time mixed mapping of PTM and depth mapping as described below.

The created models can be translated, deformed, and deleted. In terms of translation and deformation, using the view-volume constraint, the user can control the depth and normal vector without changing 2D shapes projected onto the input photo (Fig. 3).

Fig. 3. Normal manipulation with geometric constraint

(red-colored plane: plane being manipulated, green line: view-volume).

4.3 Visualization for Checking 3D Model

Texture-and-Depth Representation. The proposed modeler provides three types of texture-and-depth presentation modes to the user as follows (Fig. 4).

Projective texture mapping (PTM): re-projects the texture in the photo onto 3D models and shows the correspondence between the shapes of the models and the textures.

Depth mapping: displays the depth from the viewpoint to the models as a gray-scale view image and shows the shapes of the model clearly.

Mixed mapping: displays the models by mixing PTM and depth mapping and shows a more shape-enhanced view image compared with PTM.

These modes of presentation can be rendered by a GPU in real-time not only while viewing the models but also while creating and editing the models. Therefore, it is effective for confirming the shape of models being created.

It is often difficult for the user to confirm shapes of models from the initial viewpoint using only PTM. In such cases, the depth mapping or mixed mapping provides good clues to confirm the shapes, to find lack of planes, and to adjust the depth.

Fig. 4. Examples of PTM (left), depth mapping (center), and mixed mapping (right).

Smart Secondary View. In order to easily understand the shapes of models while they are being constructed, our modeler displays not only a primary view but also a secondary view (Fig. 5). This simultaneous representation helps the user intuitively carry out creation and confirmation of the models.

We define the criteria for determining the second view parameters as follows. 1. Update frequency: Viewing parameters should not be changed frequently. 2. Point visibility: The next point which will be created (corresponding to the

mouse cursor) must not be occluded by the other planes. 3. Front-side visibility: The view must not show the backside of the target plane. 4. Parallelism: The view should be parallel to the target plane. 5. FoV difference: The view should have a wide field of view (FoV) when the

primary view has narrow FoV, and vice versa. The modeler searches for the parameters of the second view based on the above criteria. For a real-time search of viewing parameters, the parameters are sampled coarsely.

Plane being created Fig. 5. Close-up of secondary view (left) and primary view (right).

5 Global Modeling from Local Models

5.1 Sparse Mapping Using Visual SLAM and Self-Contained Sensors

Video sequences are suitable for capturing wide areas in a short time compared with photos. Our modeler generates sparse maps of indoor environments consisting of a set of point cloud by using Visual SLAM [3] with video sequences and PDR with self-contained sensors [1].

SfM generally requires high-computational cost and long calculation time to estimate accurate camera motion parameters and a map. Consequently, for smooth in-situ modeling operations, our modeler applies Visual SLAM, which can estimate camera motion parameters and a map simultaneously and quickly, to the sparse mapping. Furthermore, measurements of the user’s position and direction by PDR can be used for setting the position and direction of photos and video sequences in the global coordinate system and the scale of the maps by simultaneously carrying out it PDR with Visual SLAM.

When the modeler handles multiple maps, they are placed in a global coordinate system based on measurements from self-contained sensors. Additionally, the global coordinate system is configured as the Z axis and X-Y plane correspond to the upward vertical direction and the ground plane. Adjustments for rotation, translation, and scaling from initial parameters estimated by PDR can be done with interactive tools. Fig. 6 shows two maps placed in a global coordinate system and the camera paths. These sparse maps are used for semi-automatic functions of integrating local and global models.

Fig. 6. Sparse maps by Visual SLAM and PDR in global coordinate system.

5.2 Semi-Automatic Integration of Local and Global Models

After creating a local model from a photo (Section 4), the local model is integrated into a global model. The integration process consists of automatic functions using Visual SLAM, PDR, and image-feature matching, and interactive tools with which the user gives information needed to integrate manually when the automatic functions fail to estimate transform parameters. This two-stage process enables the user to integrate local models into the global model effectively and reliably.

In the automatic functions, the modeler first carries out relocalisation toward the sparse maps using a photo used for local modeling by the relocalisation engine of Visual SLAM [3], and then the modeler takes camera motion parameters and its uncertainties for estimating transform parameters between the local and the global coordinate system. When the relocalisation succeeded for multiple maps, the modeler selects the most reliable camera motion parameters according to the uncertainties and the position and direction from PDR. In the case of failures of relocalisation, the modeler uses a position and direction only from PDR. However, the estimated camera motion parameters by Visual SLAM and PDR are not sufficiently accurate for registration of local models.

For more accurate registration, the modeler carries out image-feature matching between two photos; one is used for creating a local model, and the other is used for creating another local model nearest to the target local model. In recent years, robust feature detectors and descriptors such as SIFT [14] and SURF [15] have been proposed, and they are quite useful for such image-feature matching. The 2D point correspondences are converted into 3D point correspondences by using 3D local models created by the user. Then, the transform parameters between the local and global coordinate systems are estimated using RANSAC [16] robustly.

After automatic functions, the user confirms whether the local model is correctly integrated or not by viewing the displayed global model. Fig. 7 shows an example of automatic registration using functions described above. The integrated local model overlapping with a door of the global model is accurately registered in this figure. In the case that the accuracy of integration is not enough, the user can give 2D corresponding points manually or give the transform parameters (translation, rotation, and scaling) interactively.

Fig. 7. Examples of automatic integration of local model. (left):global model before

integration, (right): global model after integration.

5.3 Automatic Un-Textured-Region Detection and View Recommendation

For preventing shortages of texture images, our modeler automatically detects un-textured regions which correspond to occluded regions from all photos and presents a recommended shooting position to capture the region and the user’s current position. This prompts the user to re-capture texture images intuitively.

Un-textured regions are detected from the integrated global model and intrinsic camera parameters and camera motion parameters of the photos in the global coordinate system. The automatic detector searches for planar regions occluded from all viewpoints of photos and finds a dominant region which has the highest density of the occluded regions by 3D window search. Then, the modeler searches for appropriate viewpoint to capture the region by estimating a cost function and recommends the viewpoint. The cost function is defined from the following criteria. 1. Observability: Viewpoint should capture a whole un-textured region. 2. Easiness: Viewpoint should be below the eye level of the users. 3. Parallelism: View-direction should be parallel to un-textured region. 4. Distance: Viewpoint should be close to un-textured region. When the recommended viewpoint is placed in inapproachable positions, the user can interactively choose another viewpoint rated by the above cost function.

After estimating the recommended viewpoint, the user’s position and direction are presented onto the monitor with the global model, un-textured region, and recommended viewpoint intuitively (Fig. 8).

Self‐contained sensors(for PDR)

User’s position & direction

Recommended viewpoint

(c)(b)(a)Fig. 8. (a) appearance of user confirming un-textured region, (b) detected un-textured region,

recommended viewpoint, and the user’s position, and (c) updated model.

6 Conclusions

We have proposed an in-situ 3D modeler that supports efficient modeling for indoor environments. The modeler provides the interaction techniques by taking advantage of the features that indoor environments inherently have and geometric constraints from a photo for easily creating 3D models, and provides intuitive visualization to confirm the shapes of created models. The created local models are integrated by semi-automatic functions robustly. Furthermore, presenting un-textured regions and recommended viewpoints to capture the regions make it possible that the user create more complete models on the fly.

Our near-term work is to evaluate the effectiveness of our proposed interaction techniques, visualization, and supportive functions by creating actual indoor environments. For more effective modeling, we plan to develop functions to optimize a global model overlapped with local models, to suggest initial 3D primitives by machine learning of features in indoor environments, and to inpaint small un-textured regions.

References

1. M. Kourogi, N. Sakata, T. Okuma, and T. Kurata, “Indoor/Outdoor Pedestrian Navigation with an Embedded GPS/RFID/Self-contained Sensor System”, In Proc. of 16th Int. Conf. on Artificial Reality and Telexistence (ICAT2006), pp.1310-1321, 2006.

2. T. Okuma, M. Kourogi, N. Sakata, and T. Kurata, “A Pilot User Study on 3-D Museum Guide with Route Recommendation Using a Sustainable Positioning System”, In Proc. of Int. Conf. on Control, Automation and Systems (ICCAS2007), pp.749-753, 2007.

3. A. P. Gee, D. Chekhlov, A. Calway, and W. Mayol-Cuevas, “Discovering Higher Level Structure in Visual SLAM”, IEEE Trans. on Robotics, vol.26, no.5, pp.980-990, 2008.

4.T. Ishikawa, K. Thangamani, T. Okuma, K. Jung, and T. Kurata, “Interactive Indoor 3D Modeling from a Single Photo with CV Support”, In Electronic Proc. of IWUVR2009, 2009.

5. M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz, “Multi-View Stereo for Community Photo Collections”, In Proc. of Int. Conf. on Computer Vision (ICCV2007), pp.14-20, 2007.

6. N. Snavely, S. M. Seitz, and R. Szeliski, “Modeling the World from Internet Photo Collections”, Int. Journal of Computer Vision, vol.80, pp.189-210, 2008.

7. “Google SketchUp”, http://sketchup.google.com/. 8. B. M. Oh, M. Chen, J. Dorsey, and F. Durand, “Image-Based Modeling and Photo Editing”,

In Proc. of SIGGRAPH, pp.433-442, 2001. 9. P. E. Debevec, C. J. Taylor, and J. Malik, “Modeling and Rendering Architecture from

Photographs : A Hybrid Geometry- and Image-Based Approach”, In Proc. of SIGGRAPH, pp.11-20, 1996.

10. S. N. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Pollefeys, “Interactive 3D Architectural Modeling from Unordered Photo Collections”, ACM Trans. on Graphics, vol.27, no.5, article.5, 2008.

11. A. van den Hengel, A. Dick, T. Thormahlen, B. Ward, and P. H. S. Torr, “VideoTrace: Rapid Interactive Scene Modelling from Video”, ACM Trans. on Graphics, vol.26, no.3, article.86, 2007.

12. J. Neubert, J. Pretlove, and T. Drummond, “Semi-Autonomous Generation of Appearance-based Edge Models from Image Sequences”, In Proc. of IEEE/ACM Int. Symp. on Mixed and Augmented Reality, pp.79-89, 2007.

13.P. Bunnum and W. Mayol-Cuevas, “OutlinAR: An Assisted Interactive Model Building System with Reduced Computational Effort”, In Proc. of IEEE/ACM Int. Symp.on Mixed and Augmented Reality, pp.61-64, 2008.

14.D. G. Lowe, “Distinctive image features from scale-invariant keypoints”, Int. Journal of Computer Vision, vol.60, no.2, pp.91-110, 2004.

15.H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-Up Robust Features (SURF)”, Computer Vision and Image Understanding, vol.110, no.3, pp.346-359, 2008.

16.M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for model fitting with applications to image analysis and automated cartography”, Communication of the ACM, vol.24, no.6, pp.381-395, 1981.

in-situ 3d indoor modeler with a camera and self …...in-situ 3d indoor modeler with a camera and...

Documents