active 3d modeling via online multi-view stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9....

8
Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim, and Sungho Jo Abstract— Multi-view stereo (MVS) algorithms have been commonly used to model large-scale structures. When process- ing MVS, image acquisition is an important issue because its reconstruction quality depends heavily on the acquired images. Recently, an explore-then-exploit strategy has been used to acquire images for MVS. This method first constructs a coarse model by exploring an entire scene using a pre-allocated camera trajectory. Then, it rescans the unreconstructed regions from the coarse model. However, this strategy is inefficient because of the frequent overlap of the initial and rescanning trajectories. Furthermore, given the complete coverage of images, MVS algorithms do not guarantee an accurate reconstruction result. In this study, we propose a novel view path-planning method based on an online MVS system. This method aims to incre- mentally construct the target three-dimensional (3D) model in real time. View paths are continually planned based on online feedbacks from the partially constructed model. The obtained paths fully cover low-quality surfaces while maxi- mizing the reconstruction performance of MVS. Experimental results demonstrate that the proposed method can construct high quality 3D models with one exploration trial, without any rescanning trial as in the explore-then-exploit method. I. I NTRODUCTION Three-dimensional (3D) modeling of large-scale structures is an important and ongoing research issue [1]. To address this issue, multi-view stereo (MVS) algorithms [2] [3] are widely used because they can estimate wider depth ranges than stereo cameras or RGB-D sensors. MVS is an offline method that processes a collection of calibrated images in a batch to reconstruct the corresponding 3D model. The re- construction quality of MVS greatly depends on the collected images [4] [5]. Therefore, it is important to acquire a set of images that can fully cover a target structure to maximize the reconstruction performance of MVS. In order to determine an optimal trajectory that gener- ates the best 3D model in MVS, view path-planning algo- rithms are widely used. As a view path-planning method, an inspection-planning approach [6] [7] [8] is commonly employed when reconstructing a large-scale structure. Based on the assumption that a prior model of the target structure is known, the inspection approach aims to compute camera trajectories that generates the complete surface coverage of the prior model. However, for most real-world cases, prior information of the target structure is unavailable. To address this issue, studies have proposed the explore-then-exploit *This work was supported by the National Research Foundation of Korea funded by the Ministry of Education (No.2016R1D1A1B01013573) and by the Technology Innovation Program funded by the Ministry of Trade, industry & Energy (MI, Korea) (No.10070171). Soohwan Song, Daekyum Kim, and Sungho Jo are with School of Computing, KAIST, Daejeon 34141, Republic of Korea. {dramanet30, daekyum, shjo}@kaist.ac.kr Fig. 1. 3D reconstruction results. (a) The explore-then-exploit method [9] with offline MVS [2]. It first constructs an initial coarse model from a default trajectory (in red) and then computes the coverage path (in blue). It takes about 11 hours in total (3 hours for coarse model and 8 hours for detail reconstruction) to process all acquired images. (b) Our method: view- planning and online MVS system. It reconstructs a 3D scene in real-time and completes the modeling process in one exploration trial. method [1] [5] [9] [10]. This method first constructs a coarse model by exploring an entire scene using a fixed trajectory within a safe area. Then, based on the coarse model, it plans an inspection path for the entire model. However, modeling performance of the explore-then- exploit method can be degraded for several reasons. First, this method sometimes generates overlapped trajectories, which leads to inefficient performance in time because the camera needs to scan the same areas repeatedly. Second, even though given images fully cover the target structure, the MVS algorithms are not guaranteed to generate a complete and accurate reconstructed model due to textureless scenes, short baseline distances, and occlusions. Third, the MVS algorithms usually take a long time to process the images, which makes the entire modeling process extremely slow. To address these problems, we propose a novel view path- planning method based on an online MVS system. The proposed online MVS system extends monocular mapping algorithms [11] [12] [13], which focus only on the local dense mapping, to be adaptable to constructing a large-scale model. The proposed system handles large amounts of noises and outliers in 3D data using several post-processing steps, including noise filtering and depth interpolation. Unlike the offline methods, our method incrementally constructs the large-scale 3D models using the surfel mapping method [14] in real-time. By doing so, it enables to evaluate the completeness of a model by analyzing the quality of the reconstructed surface. The proposed method iteratively plans view paths using the reconstruction quality feedback. It determines the best views to acquire reference and source images using the heuristic information of MVS. Based on the determined views, it provides an optimal camera trajectory that satisfies the followings: i) to cover low-quality surfaces 2020 IEEE International Conference on Robotics and Automation (ICRA) 31 May - 31 August, 2020. Paris, France 978-1-7281-7395-5/20/$31.00 ©2020 IEEE 5284 Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Upload: others

Post on 12-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

Active 3D Modeling via Online Multi-View Stereo

Soohwan Song, Daekyum Kim, and Sungho Jo

Abstract— Multi-view stereo (MVS) algorithms have beencommonly used to model large-scale structures. When process-ing MVS, image acquisition is an important issue because itsreconstruction quality depends heavily on the acquired images.Recently, an explore-then-exploit strategy has been used toacquire images for MVS. This method first constructs a coarsemodel by exploring an entire scene using a pre-allocated cameratrajectory. Then, it rescans the unreconstructed regions fromthe coarse model. However, this strategy is inefficient because ofthe frequent overlap of the initial and rescanning trajectories.Furthermore, given the complete coverage of images, MVSalgorithms do not guarantee an accurate reconstruction result.

In this study, we propose a novel view path-planning methodbased on an online MVS system. This method aims to incre-mentally construct the target three-dimensional (3D) modelin real time. View paths are continually planned based ononline feedbacks from the partially constructed model. Theobtained paths fully cover low-quality surfaces while maxi-mizing the reconstruction performance of MVS. Experimentalresults demonstrate that the proposed method can constructhigh quality 3D models with one exploration trial, without anyrescanning trial as in the explore-then-exploit method.

I. INTRODUCTION

Three-dimensional (3D) modeling of large-scale structuresis an important and ongoing research issue [1]. To addressthis issue, multi-view stereo (MVS) algorithms [2] [3] arewidely used because they can estimate wider depth rangesthan stereo cameras or RGB-D sensors. MVS is an offlinemethod that processes a collection of calibrated images ina batch to reconstruct the corresponding 3D model. The re-construction quality of MVS greatly depends on the collectedimages [4] [5]. Therefore, it is important to acquire a set ofimages that can fully cover a target structure to maximizethe reconstruction performance of MVS.

In order to determine an optimal trajectory that gener-ates the best 3D model in MVS, view path-planning algo-rithms are widely used. As a view path-planning method,an inspection-planning approach [6] [7] [8] is commonlyemployed when reconstructing a large-scale structure. Basedon the assumption that a prior model of the target structureis known, the inspection approach aims to compute cameratrajectories that generates the complete surface coverage ofthe prior model. However, for most real-world cases, priorinformation of the target structure is unavailable. To addressthis issue, studies have proposed the explore-then-exploit

*This work was supported by the National Research Foundation of Koreafunded by the Ministry of Education (No.2016R1D1A1B01013573) andby the Technology Innovation Program funded by the Ministry of Trade,industry & Energy (MI, Korea) (No.10070171).

Soohwan Song, Daekyum Kim, and Sungho Jo are with School ofComputing, KAIST, Daejeon 34141, Republic of Korea. {dramanet30,daekyum, shjo}@kaist.ac.kr

Fig. 1. 3D reconstruction results. (a) The explore-then-exploit method[9] with offline MVS [2]. It first constructs an initial coarse model from adefault trajectory (in red) and then computes the coverage path (in blue).It takes about 11 hours in total (3 hours for coarse model and 8 hours fordetail reconstruction) to process all acquired images. (b) Our method: view-planning and online MVS system. It reconstructs a 3D scene in real-timeand completes the modeling process in one exploration trial.

method [1] [5] [9] [10]. This method first constructs a coarsemodel by exploring an entire scene using a fixed trajectorywithin a safe area. Then, based on the coarse model, it plansan inspection path for the entire model.

However, modeling performance of the explore-then-exploit method can be degraded for several reasons. First,this method sometimes generates overlapped trajectories,which leads to inefficient performance in time because thecamera needs to scan the same areas repeatedly. Second,even though given images fully cover the target structure, theMVS algorithms are not guaranteed to generate a completeand accurate reconstructed model due to textureless scenes,short baseline distances, and occlusions. Third, the MVSalgorithms usually take a long time to process the images,which makes the entire modeling process extremely slow.

To address these problems, we propose a novel view path-planning method based on an online MVS system. Theproposed online MVS system extends monocular mappingalgorithms [11] [12] [13], which focus only on the localdense mapping, to be adaptable to constructing a large-scalemodel. The proposed system handles large amounts of noisesand outliers in 3D data using several post-processing steps,including noise filtering and depth interpolation. Unlike theoffline methods, our method incrementally constructs thelarge-scale 3D models using the surfel mapping method[14] in real-time. By doing so, it enables to evaluate thecompleteness of a model by analyzing the quality of thereconstructed surface. The proposed method iteratively plansview paths using the reconstruction quality feedback. Itdetermines the best views to acquire reference and sourceimages using the heuristic information of MVS. Based on thedetermined views, it provides an optimal camera trajectorythat satisfies the followings: i) to cover low-quality surfaces

2020 IEEE International Conference on Robotics and Automation (ICRA)31 May - 31 August, 2020. Paris, France

978-1-7281-7395-5/20/$31.00 ©2020 IEEE 5284

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 2: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

Fig. 2. Overall system architecture of the proposed active 3D modeling framework (see Section III for more details).

in the current scene, and ii) to improve the stereo matchingperformance. This planning method then constructs a com-plete and accurate 3D model in one exploration trial, withoutany rescanning trial like the explore-then-exploit (Fig. 1).

The contributions of this paper are summarized as follows:(i) Unlike existing approaches, we employ an online MVSsystem based on the monocular mapping algorithm forconstructing 3D models to address the view path-planningproblem. (ii) We propose a view path-planning method thatperforms a trajectory optimization and view selection tomaximize the performance of MVS reconstruction. (iii) Weempirically evaluate the proposed method in two simulatedscenarios using a Micro Aerial Vehicle (MAV). The effective-ness and applicability of the proposed method are evaluatedin comparison with existing methods.

II. RELATED WORK

The problem of computing optimal views during theconstruction of a 3D model is known as view planning oractive vision [15] [16] [17]. Different types of view-planningmethods have been developed and used for different recon-struction algorithms. Real-time mapping algorithms [18] [19]accumulate 3D data directly from a RGB-D sensor. Becausethese algorithms can evaluate modeling completeness online,exploration approaches are frequently used to plan viewpaths by simultaneously identifying uncovered areas from thepartially-reconstructed model. Most approaches iterativelydetermined the next-best-view (NBV) [20] [21] or viewpath [22] for exploration planning by analyzing frontiers ina volumetric map. Song and Jo [23] proposed a surface-based exploration method, which simultaneously considersunexplored regions and reconstruction quality. Our workextends this method [23] by incorporating with the trajectoryoptimization step to improve the performance of MVS.

Several methods [24] [25] have been proposed to modelthe target scene directly from a sparse point cloud usingstructure-from-motion (SfM). Hoppe et al. [24] proposedan online SfM framework that provides a visual feedbackof reconstruction quality to a human operator. Haner andHeyden [25] proposed a covariance propagation method thatdetermines a view sequence to minimize the reconstructionuncertainty of the sequential SfM.

The MVS algorithms have been used in large-scale struc-ture modeling because they can estimate a wide depth rangeof the target scene. Some approaches [26] [27] acquiredimages for MVS reconstruction using a simple lawn mower

or circular trajectories in a safe overhead area. However,these approaches do not guarantee complete coverage of thetarget structure. Therefore, an explore-then-exploit approachis generally adopted in 3D modeling systems based onan MVS algorithm [28] [9] [5] [10]. Roberts et al. [9]proposed the modeling of surface coverage by observing ahemispheric area around the surface. This method computesa coverage trajectory, such that the camera observes surfacesfrom diverse viewpoints by maximizing the observing area ofthe hemispheres. Huang et al. [5] proposed a relatively fastMVS algorithm that reconstructs coarse 3D models. Theirmethod iteratively determines the NBVs by analyzing thecoverage of a tentative surface model.

There have been only a few methods [29] [30] [31] thatdeal with an online MVS algorithm. Mendez et al. [29][30] used an online dense stereo matching algorithm [32]based on deep learning. They proposed a next-best-stereomethod that determines the best stereo pair to maximizestereo matching performance. Forster et al. [31] addresseda view-planning problem that acquires an informative mo-tion trajectory for monocular dense depth estimation. Theseapproaches concentrated only on the local depth estimation,whereas global modeling was not considered. On the otherhand, the proposed method deals with not only the localtrajectory for local mapping, but also the global trajectoryfor constructing a global model.

III. SYSTEM OVERVIEW

Fig. 2 illustrates an overall system architecture of theproposed framework, which is composed of two modules:online 3D modeling and path planning. The 3D modelingmodule constructs 3D models online from an obtained imagesequence. It first computes the camera poses from a SLAMsystem [33] and generates a depth map of a scene using amonocular mapping method [11]. The obtained depth mapsare integrated into a volumetric mapM and a surface modelF simultaneously. The volumetric map [19] categorizes anenvironment as three states (unknown, free, and occupied),which is required for planning a collision-free path. Thesurface model is represented as a collection of dense pointprimitives, surfels, which is efficient for surface deformationand rendering.

The path-planning module computes the exploration pathsto reconstruct a target structure. The module first plans aglobal path by analyzing the volumetric map in order toexplore a large unknown area. Then, the module computes a

5285

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 3: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

Fig. 3. Examples of a reconstructed point cloud of a scene by (a) initialdepth, (b) fully-interpolated depth, and (c) partially-interpolated depth. (d)Delaunay triangulation result of a set of support depth points. The weightedleast squares (WLS) filter is applied in planar areas (green triangles),excluding non-planar areas (red triangles).

local inspection path that covers the defectively reconstructedsurfaces in the surface model. The planned local path isrefined to maximize the performance of MVS reconstruction.The mobile robot constantly navigates along the planned pathwhile constructing the 3D models of the target structure.

IV. ONLINE MULTI-VIEW STEREO

The online MVS system is based on a monocular densemapping algorithm [11]. The system obtains images and thecorresponding camera poses using a SLAM module. Weuse a keyframe-based SLAM method [33] that computesa camera pose by estimating the sparse map points fromselected keyframes. When a new keyframe is extracted, theimage-pose pair is stored in a database, and the keyframeis set as a reference image for depth estimation. Unlikeexisting monocular mapping algorithms [11] [12], which usesubsequent sequential images as source images for stereomatching, our method selects the best set of source images inan online and active manner to improve the depth estimationperformance. (see Section V.E for active image selection)

A. Depth Estimation

A depth map Dt of a reference image Iref is obtained bypixel-wise stereo matching, given a source image It at time t.Obtained depth maps D1, ..., DT are sequentially integratedusing the REMODE [11], a recursive Bayesian estimationapproach, which integrates sequential depths. For each pixel,the algorithm recursively updates the mean depth d and itsvariance σ2, which follow a Gaussian distribution, and theinlier probability ρ. A depth estimate is considered convergedwhen ρ > θinlier and σ2 < θvar. The algorithm then appliesa regularization filter, a variant of the total variation.

B. Depth Post-Processing

The depth maps estimated from REMODE relatively havemore number of unconverged pixels (in textureless regions)and outliers in converged pixels than those from the offlineMVS methods [2] [3]. Therefore, we first remove the outlierdepths by checking whether a depth point is supported byneighboring pixels as proposed in [34]. Next, we interpolate

the unconverged depth values while enforcing depth smooth-ness of the coplanar surfaces. We employ the fast weightedleast squares (WLS) approach [35] for depth interpolation.WLS performs the image-guided, edge-preserving interpola-tion by solving a global optimization problem.

Although the WLS method is usually effective in texture-less or planar regions, it causes noise in regions near depthdiscontinuities and non-planar surfaces [36]. Furthermore,depth values could be interpolated into empty areas, suchas the sky or distant blurry regions (Fig. 3b). To address thisissue, we extract piecewise planar areas in a target structureand apply the WLS filter only to the planar areas.

First, we divide the entire image region into rectangulargrids and determine a sparse set of support depth points byselecting the median depth point within each grid. Then, wecompute a two-dimensional (2D) Delaunay triangulation ofthe pixel locations of the support depth points using thefast triangulation method [37]. Triangular meshes each ofwhose edges are longer than a certain threshold in 3D areeliminated from the triangular set. We segment an image intoa set of triangular regions T = [T1, ..., TK ] according to theconstructed triangular meshes. Each triangular region Tk canbe described by its plane parameters τk = (τ1k , τ

2k , τ

3k ) ∈ R3;

a depth in a pixel p is defined as dp = τ1kpx + τ2kpy + τ3kwhere px and py denote p’s x- and y-coordinates [38]. Givena depth map D′ interpolated by the WLS filter, we definethe planarity of each triangular region Tk as

fpl(Tk) =1

#(Tk)

∑p∈Tk

1[|τ1kpx + τ2kpy + τ3k − d′p| < dthr]

(1)where #(Tk) is the number of pixels in Tk, dthr is the depththreshold, and 1[·] is the indicator function. This measuresimply represents the ratio of interpolated depths that areconsistent with a plane parameter τk. If the ratio is higherthan a threshold (0.95 in this study), the region is labeledas a planar area (Fig. 3d). Similar to the method in [36],the interpolated depth d′ is applied in the planar area, andthe initial depth d is used in the non-planar area. Fig. 3cillustrates an example of the post-processed result.

C. Surfel Mapping

The surface model F , which represents a reconstructedsurface, is composed of 3D surfels, where each surfel hasthe following attributes: a 3D position, normal, color, weight,and radius. We employ the surfel initialization and surfelfusion method as described in [14]. An initial weight w isdirectly mapped by a depth variance σ2 as: w = σmax−

√σ2,

where σmax is user defined maximum standard deviation (1.0in this study). Weight of an interpolated depth is set as aconstant wconst = σmax −

√θvar. As in [14], we label the

surfels that have not been fused in a period as inactive, andfilter out low-weight inactive surfels.

D. Processing Loop Closing

When the SLAM module performs loop closing, ourmethod deforms the surface model and the volumetric map

5286

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 4: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

Poisson surface reconstruction Confidence measure & target extractionGlobal coverage path planning Local inspection path planning Information gain map & trajectory optimization

(a) Global Path Planning (b) Target Surface Extraction (c) Local Path Planning

Fig. 4. Overview of proposed path-planning method. The proposed method is composed of three steps. (a) First, a global path is determined by computingthe global coverage of unexplored regions. (b) Second, target surfaces (gray arrow points) are extracted by evaluating the reconstruction quality. (c) Third,local path planning provides an inspection path for the target surfaces. The inspection path is optimized to improve the performance of MVS.

according to the updated pose graph. We employ the surfeldeformation method of [39] for the surface model. Insteadof using a deformation graph [40], the method individuallytransforms the position of each surfel to preserve the globalconsistency with the SLAM module. After surfel deforma-tion, we re-initialize the volumetric map to an unknownstate and determine the occupied volumes directly from thedeformed surfel position. We then update the free spaceby casting rays from each updated pose to the occupiedvolumes. The volumes that are already assigned to occupiedstate are excluded when free space is updated. The ray-casting is performed at twice the coarseness of the resolutionof M for the fast update.

V. PATH PLANNING METHOD

The aim of this study is to reconstruct a high-confidencesurface model by exploring an unknown and bounded spaceas fast as possible. Algorithm 1 presents the pseudocodeof the proposed path-planning method, which consists ofthree steps, as depicted in Fig. 4: Global path planning,Target surface extraction, and Local path planning. At everyiteration, the proposed method first determines a globalpath that maximizes the coverage of the unexplored region(Section V.A). Then, it plans a local path that completes toreconstruct the surface model while maximizing the perfor-mance of the online MVS. To compute a local path, theproposed method extracts low-confidence surfaces (SectionV.B) and determines a set of view configurations to acquirereference images of each low-confidence surface (SectionV.C). Finally, it computes an optimal path that maximizesthe performance of MVS when traversing all the referenceviewpoints (Section V.D).

A. Global Path Planning

Our method divides the entire frontier in M into a set ofclusters Vfront (line 1) and computes the global coverageof the clusters. It explores the unknown region followingthe coverage sequentially. We formulate the problem ofglobal coverage path planning as a generalized travelingsalesman problem (GTSP) [41]. Let q ∈ Q be a feasible viewconfiguration in configuration space Q, and ξ : [0, 1] → Qbe a path. For each frontier cluster Vi ∈ Vfront, we generatea set of view configuration samples Qi in which more thana certain percentage of frontiers in Vi are visible (line 2).Given a set of sample sets {Q1, ..., QN}, the GTSP algorithm

Algorithm 1 Proposed path planning algorithmInput: Volumetric map M, Surface model F , and Current con-

figuration qcurr ./* Global path planning */

1: Vfront ← FrontierClustering(M)2: {Q1, ..., QN} ← GlobalSampling(Vfront)3: {q1, ..., qN} ← SolveGTSP ({Q1, ..., QN}, qcurr)4: {qNBV , ξglobal} ← GetPath(qcurr, q1)

/* Local path planning */5: while qcurr 6= qNBV do6: if TravelT ime > θtime then7: Xtarget ← GetTargetSurfPoints(F)8: {Q1, ..., QN} ← LocalSampling(Xtarget, ξglobal)9: {q1, ..., qN} ← SolveGTSP ({Q1, ..., QN}, qcurr, qNBV )

10: {Qref , ξlocal} ← GetPath(qcurr, {q1, ..., qN}, qNBV )11: ξ∗local ← OptmizePath(Xtarget, Qref , ξlocal)12: end if13: MoveToward(ξ∗local)14: Update(M,F , qcurr)15: end while

selects a representative point qi from each sample set Qi andobtains the shortest tour, departing from a current view qcurrand visiting all the representative points qi (line 3). Pathsare generated using the A* planner, which examines theminimum Euclidean distance among visiting points. Giventhe resulting coverage path, we select the first sample q1 tobe the next view configuration qNBV and compute the globalpath ξglobal to qNBV (line 4).

B. Target Surface Extraction

Our method first reconstructs a tentative surface model Ffrom the surfels in F using the screened Poisson reconstruc-tion algorithm [42]. We represent the reconstructed surfacesin F as a set of surface points X , where each point x ∈ Xcontains 3D position and normal values. The method thengroups adjacent surface points by Euclidean clustering. LetXj ⊂ X be a set of clustered surface points and xj bethe averaged surface point of Xj . We define the confidenceof xj as the average weight wj of neighboring surfels in F .Finally, the averaged surface points whose confidence valuesare lower than θthr−conf are determined as the target surfacepoints Xtarget (line 7).

C. Local Inspection Path Planning

This section describes a method to compute an inspectionpath that provides coverage of the target surfaces. For each

5287

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 5: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

target surface point xj , the method generates a set ofview configurations Qj that observe the target point xj byinversely composing a view frustum from xj to its normaldirection [23] (line 8). We reject a sample configuration qjif the distance of the path via qj is γ times (1.3 in thispaper) larger than the distance of the direct path from qcurrto qNBV . Similar to Section V.A, the GTSP algorithm is usedto determine a tour starting from qcurr, visits exactly onesample qj per sample set Qj , and ends at qNBV (line 9). Thelocal path ξlocal is computed by sequentially connecting theselected samples (line 10). We refer to the selected tour set{q1, ..., qN} as the reference configuration set Qref , whichis used to reference views for each target point.

D. Trajectory Optimization

After determining a local inspection path for the targetsurfaces, our method refines the path to maximize the MVSperformance (line 11). The following subsections introducestereo-pair heuristics for predicting the reconstruction qualityof MVS and describe how to apply these to the trajectoryoptimization problem.

1) Multi-view Stereo Heuristics: Given a stereo-pair of areference view configuration qref and a source view con-figuration qsrc, the reconstruction quality of a target surfacex is related to geometric factors, such as parallax, relativedistance, and focus.

Parallax: There is a trade-off between the triangulationaccuracy and the matchability according to the parallax ofa stereo pair. Let α be a parallax angle between pair setsin a stereo pair. We describe the informativeness of α toreconstruct the correct surface as a score function [10]:

fprx(α) = exp

(− (α− α0)2

2σ2prx

)(2)

where α0 is the desired parallax angle, which is heuristicallydetermined as 15◦, and σprx is a constant value.

Relative distance: The image patches from the referenceand source images must have a similar resolution for accuratestereo matching [2]. We assume that the views at the samedistance to a surface have the same resolution of the surface.A score function regarding the relative distance is defined as

frd(distsrc, distref ) =min(distsrc, distref )

max(distsrc, distref )(3)

where distsrc and distref denote the distance between x andqsrc, and the distance between x and qref , respectively.

Focus: The surface region preferably projects to the prin-cipal point of the source image to reduce the reprojectionerror in triangulation [29] [30]. Let rco and rcx be rays froma camera center c of a source image to the principal point oand to a surface point x, respectively. We define a penalizingfunction for large angle β between the rays rco and rcx as

ffoc(β) = exp

(− β2

2σ2foc

)(4)

where σfoc is a constant value.

Integration: We integrate the heuristics into a score func-tion that predicts the reconstruction quality of MVS:

fsrc(qsrc, qref , x) = fvis · fprx · frd · ffoc (5)

where fvis is the visibility function that returns the value 1if x is visible from qsrc, and the value 0 otherwise.

2) Informative Path Planning: A set of disjoint pathsegments ξ = {ξ1, ..., ξN} is based on each referenceconfiguration in Qref . Each segment ξs is a path connectingthe consecutive reference view configurations. Let IG(ξs)be a function that represents the information gathered alongξs and TIME(ξs) be the corresponding travel time. Theoptimal path is determined by solving the following problem:

ξ∗ = argmaxξ

∑ξs∈ξ

IG(ξs)

TIME(ξs),

s.t TIME(ξs) ≤ Bs for every segment s

(6)

where Bs is a time budget of segment s. We define a budgetas Bs = γ′× TIME(ξs), where γ′ is a constant value, andξs is the shortest path from the starting point to the end pointof ξs. An information gain function is defined as

IG(ξs) =∑qi∈ξs

∑qr∈Qref

fsrc(qi, qr, xr) (7)

where qi is a discrete configuration along ξs, qr is a referenceview configuration, and xr is the target surface of qr. Eq. 6 isan informative path-planning problem that can be solved asan optimization problem. To solve the problem, we employthe local optimization step in [43], which uses the covariancematrix adaptation evolution strategy [44]. The strategy isbased on a Monte Carlo method; it uses the process ofrepeated random sampling to estimate the solution.

E. Reference and Source Image Selection

Our method consistently determines reference imagesevery time a keyframe is extracted. The keyframes areconsistently extracted at a regular frame interval. Given areference image, N keyframes (10 in this paper) are selectedfor a source image set by evaluating the score function Eq. 5for each path segment ξs. The qNBV at the end of ξ∗local doesnot have a target point; thus, it uses N -neighbor keyframesthat share most map-point observations for source images.

VI. EXPERIMENTAL RESULTS

In this section, we conducted the simulation experimentsto evaluate the performance of the proposed method. Afirefly hexacopter MAV was used as a mobile robot in theRotorS simulation environment [45]. The forward-lookingstereo camera, mounted on the MAV, had a pitch angle 5◦

downward and a field of view [60◦, 90◦]. It captured imageswith a resolution 752×480px. We used a stereo version of theORB-SLAM [46] to obtain a metric scale of the motion andmap points. For a reliable pose estimation, we restricted themaximum translational speed 0.5m/s and rotational speed0.25m/s and covered an textured scene to the ground. Weused only the left images on the stereo camera for MVScomputation.

5288

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 6: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

Fig. 5. Experimental results of scenario 1 (upper) and scenario 2 (lower). Reconstructed 3D models and volumetric maps with trajectories taken by theMAV at the end of executions of (b, g) circular trajectory, (c, h) NBV [5], (d, i) Sub-Cov [9], and (e, j) our method.

TABLE IEXPERIMENTAL RESULTS OF TWO SCENARIOS.

Methods Time(min)

PathLength(m)

Error(m)

Comp.0.05m(%)

Comp.0.10m(%)

Comp.0.15m(%)

Scen

ario

1 Circular 12.68 239.6 0.1596 29.33 52.77 66.64NBV 29.60 643.8 0.0987 53.06 71.97 82.19

Sub-Cov 28.72 654.5 0.0805 57.45 73.98 82.71Ours 18.37 383.1 0.0746 56.85 72.35 84.46

Scen

ario

2 Circular 15.60 290.3 0.1560 30.10 46.20 59.70NBV 43.12 891.6 0.1273 48.38 65.77 77.67

Sub-Cov 46.65 1016.5 0.0961 52.24 69.60 82.92Ours 29.64 498.2 0.0736 62.51 80.07 86.75

There are two-simulation scenarios [47]: a structure withhighly textured surfaces (Catholic Church: 36× 28× 30m3)and a structure with relatively less-textured surfaces (Col-orado State Capitol: 50×46×30m3). The proposed methodwas compared with the NBV [5] and submodular coverage(Sub-Cov) [9] methods. These methods use the explore-then-exploit strategy; after constructing an initial coarse modelfrom a default trajectory, they compute the coverage path[9] or NBVs [5] for the coarse model. The NBV methoditeratively determines the best viewpoint online from a partialreconstruction. The Sub-Cov method computes the coveragepath of the initial model offline. Both the initial and finalmodels were constructed using our online modeling system.We performed an initial scan of a circular trajectory aroundthe target space with a camera pitch of 20◦. We tuned thetravel budget of Sub-Cov to obtain the best modeling perfor-mance in each scenario. Every reconstructed surface modelwas post-processed using the Poisson surface reconstruction.

The performance of the proposed method, compared tothe other methods, was evaluated from two different perspec-tives: path efficiency and modeling quality. To evaluate thepath efficiency, we computed the completion time and totalpath length. The modeling quality refers to the accuracy and

completeness of the surface model [48]. The accuracy wasestimated by calculating the mean errors. The completenesswas defined as the percentage of surface points that havedistance error smaller than thresholds. Fig. 5 depicts thepaths and the corresponding models of the best trial. TableI presents the average results of five executions. The resultsof NBV and Sub-Cov show the cumulative time and pathlength during the whole explore-then-exploit process.

Our method had the best performances in terms of thepath efficiency. Our method provided an efficient coveragepath that did not frequently overlap while taking less traveltime without an initial scanning. Moreover, even with theseshort trajectories, our method achieved the best modelingperformances in terms of the accuracy and completeness.Particularly in Scenario 2, our method outperforms the othersin terms of overall performance. The modeling system doesnot guarantee a complete and accurate depth estimate of areference image; therefore, several factors such as recon-struction quality and MVS heuristics should be consideredonline. However, NBV and Sub-Cov do not consider thesefactors during path planning. Especially, NBV focuses onlyon the viewpoint that covers the largest surface area whiledisregarding minor surfaces, so their reconstructed modelscan be incomplete. Our method, on the other hand, focuseson completing the insufficiently reconstructed regions byexamining the completeness of reconstructed surfaces. It alsooptimizes a path to obtain the best reference and source im-ages; these approaches enhance the modeling performances.

VII. CONCLUSION

We proposed an online MVS algorithm for view pathplanning of a large-scale structure. The proposed methodcomputes a view path using the online feedback based onthe quality of the reconstructed surface. The computed pathprovides full coverage of low-quality surfaces while maxi-mizing the reconstruction performance of MVS. To the bestof our knowledge, this is the first work that implements anexploration planning approach to model unknown structuresusing an online MVS system.

5289

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 7: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

REFERENCES

[1] B. Hepp, M. Nießner, and O. Hilliges, “Plan3d: Viewpoint andtrajectory optimization for aerial multi-view stereo reconstruction,”ACM Transactions on Graphics, vol. 38, no. 1, p. 4, 2018.

[2] J. L. Schonberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixel-wise view selection for unstructured multi-view stereo,” in EuropeanConference on Computer Vision (ECCV). Springer, 2016, pp. 501–518.

[3] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiviewstereopsis,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 32, no. 8, pp. 1362–1376, 2010.

[4] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler,M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark withhigh-resolution images and multi-camera videos,” in 2017 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), 2017,pp. 3260–3269.

[5] R. Huang, D. Zou, R. Vaughan, and P. Tan, “Active image-basedmodeling with a toy drone,” in 2018 IEEE International Conferenceon Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8.

[6] A. Bircher, M. Kamel, K. Alexis, M. Burri, P. Oettershagen, S. Omari,T. Mantel, and R. Siegwart, “Three-dimensional coverage path plan-ning via viewpoint resampling and tour optimization for aerial robots,”Autonomous Robots, vol. 40, no. 6, pp. 1059–1078, 2016.

[7] W. Jing, J. Polden, W. Lin, and K. Shimada, “Sampling-based viewplanning for 3d visual coverage task with unmanned aerial vehicle,”in 2016 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS). IEEE, 2016, pp. 1808–1815.

[8] M. D. Kaba, M. G. Uzunbas, and S. N. Lim, “A reinforcement learningapproach to the view planning problem,” in 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp.5094–5102.

[9] M. Roberts, D. Dey, A. Truong, S. Sinha, S. Shah, A. Kapoor, P. Han-rahan, and N. Joshi, “Submodular trajectory optimization for aerial 3dscanning,” in 2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 5324–5333.

[10] N. Smith, N. Moehrle, M. Goesele, and W. Heidrich, “Aerial pathplanning for urban scene reconstruction: a continuous optimizationmethod and benchmark,” in SIGGRAPH Asia 2018 Technical Papers.ACM, 2018, p. 183.

[11] M. Pizzoli, C. Forster, and D. Scaramuzza, “Remode: Probabilistic,monocular dense reconstruction in real time,” in 2014 IEEE Interna-tional Conference on Robotics and Automation (ICRA). IEEE, 2014,pp. 2609–2616.

[12] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Densetracking and mapping in real-time,” in 2011 IEEE InternationalConference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.

[13] K. Wang, W. Ding, and S. Shen, “Quadtree-accelerated real-timemonocular dense mapping,” in 2018 IEEE/RSJ International Confer-ence on Intelligent Robots and Systems (IROS). IEEE, 2018, pp.1–9.

[14] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, andS. Leutenegger, “Elasticfusion: Real-time dense slam and light sourceestimation,” The International Journal of Robotics Research, vol. 35,no. 14, pp. 1697–1716, 2016.

[15] J. Aloimonos, I. Weiss, and A. Bandyopadhyay, “Active vision,”International Journal of Computer Vision, vol. 1, no. 4, pp. 333–356,1988.

[16] R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos, “Revisiting active per-ception,” Autonomous Robots, vol. 42, no. 2, pp. 177–196, 2018.

[17] N. J. Sanket, C. D. Singh, K. Ganguly, C. Fermuller, and Y. Aloi-monos, “Gapflyt: Active vision based minimalist structure-less gapdetection for quadrotor flight,” IEEE Robotics and Automation Letters,vol. 3, no. 4, pp. 2799–2806, 2018.

[18] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon,“Kinectfusion: Real-time dense surface mapping and tracking,” inInternational Symposium on Mixed and Augmented Reality (ISMAR),vol. 11, no. 2011, 2011, pp. 127–136.

[19] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Bur-gard, “Octomap: An efficient probabilistic 3d mapping frameworkbased on octrees,” Autonomous Robots, vol. 34, no. 3, pp. 189–206,2013.

[20] T. Cieslewski, E. Kaufmann, and D. Scaramuzza, “Rapid explorationwith multi-rotors: A frontier selection method for high speed flight,”in 2017 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS). IEEE, 2017, pp. 2135–2142.

[21] Z. Meng, H. Qin, Z. Chen, X. Chen, H. Sun, F. Lin, and M. H.Ang Jr, “A two-stage optimized next-view planning framework for3-d unknown environment exploration, and structural reconstruction,”IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1680–1687,2017.

[22] S. Song and S. Jo, “Online inspection path planning for autonomous3d modeling using a micro-aerial vehicle,” in 2017 IEEE InternationalConference on Robotics and Automation (ICRA). IEEE, 2017, pp.6217–6224.

[23] ——, “Surface-based exploration for autonomous 3d modeling,” in2018 IEEE International Conference on Robotics and Automation(ICRA). IEEE, 2018, pp. 1–8.

[24] C. Hoppe, M. Klopschitz, M. Rumpler, A. Wendel, S. Kluckner,H. Bischof, and G. Reitmayr, “Online feedback for structure-from-motion image acquisition.” in British Machine Vision Conference(BMVC), vol. 2, 2012, p. 6.

[25] S. Haner and A. Heyden, “Covariance propagation and next best viewplanning for 3d reconstruction,” in European Conference on ComputerVision (ECCV). Springer, 2012, pp. 545–556.

[26] “Pix4d. pix4dcapture.” 2017. [Online]. Available:http://pix4d.com/product/pix4dcapture

[27] A. P. U. Manual, “Professional edition,” Aplastic Anemia (HypoplasticAnemia), 2014.

[28] C. Hoppe, A. Wendel, S. Zollmann, K. Pirker, A. Irschara, H. Bischof,and S. Kluckner, “Photogrammetric camera network design for microaerial vehicles,” in Computer Vision Winter Workshop (CVWW), vol. 8,2012, pp. 1–3.

[29] O. Mendez Maldonado, S. Hadfield, N. Pugeault, and R. Bowden,“Next-best stereo: extending next best view optimisation for collabo-rative sensors,” British Machine Vision Conference (BMVC), 2016.

[30] O. Mendez, S. Hadfield, N. Pugeault, and R. Bowden, “Taking thescenic route to 3d: Optimising reconstruction from moving cameras,”in 2017 IEEE International Conference on Computer Vision (ICCV),2017, pp. 4677–4685.

[31] C. Forster, M. Pizzoli, and D. Scaramuzza, “Appearance-based active,monocular, dense reconstruction for micro aerial vehicle,” in 2014Robotics: Science and Systems Conference, no. EPFL-CONF-203672,2014.

[32] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow:Large displacement optical flow with deep matching,” in 2013 IEEEInternational Conference on Computer Vision (ICCV), 2013, pp. 1385–1392.

[33] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: aversatile and accurate monocular slam system,” IEEE Transactionson Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.

[34] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometryfor a monocular camera,” in 2013 IEEE International Conference onComputer Vision (ICCV), 2013, pp. 1449–1456.

[35] D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. N. Do, “Fast globalimage smoothing based on weighted least squares,” IEEE Transactionson Image Processing, vol. 23, no. 12, pp. 5638–5653, 2014.

[36] D. Gallup, J.-M. Frahm, and M. Pollefeys, “Piecewise planar and non-planar stereo for urban scene reconstruction,” in 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition.IEEE, 2010, pp. 1418–1425.

[37] J. R. Shewchuk, “Delaunay refinement algorithms for triangular meshgeneration,” Computational Geometry, vol. 22, no. 1-3, pp. 21–74,2002.

[38] M. Bleyer, C. Rhemann, and C. Rother, “Patchmatch stereo-stereomatching with slanted support windows.” in British Machine VisionConference (BMVC), vol. 11, 2011, pp. 1–11.

[39] K. Wang, F. Gao, and S. Shen, “Real-time scalable dense surfelmapping,” in 2019 IEEE international conference on robotics andautomation (ICRA). IEEE, 2019, pp. 6919–6925.

[40] R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation forshape manipulation,” ACM Transactions on Graphics, vol. 26, no. 3,p. 80, 2007.

[41] B. Hu and G. R. Raidl, “Effective neighborhood structures for thegeneralized traveling salesman problem,” in European Conference onEvolutionary Computation in Combinatorial Optimization. Springer,2008, pp. 36–47.

5290

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.

Page 8: Active 3D Modeling via Online Multi-View Stereonmail.kaist.ac.kr/paper/icra2020.pdf · 2020. 9. 24. · Active 3D Modeling via Online Multi-View Stereo Soohwan Song, Daekyum Kim,

[42] M. Kazhdan and H. Hoppe, “Screened poisson surface reconstruction,”ACM Transactions on Graphics, vol. 32, no. 3, p. 29, 2013.

[43] M. Popovic, G. Hitz, J. Nieto, I. Sa, R. Siegwart, and E. Galceran,“Online informative path planning for active classification using uavs,”in 2017 IEEE International Conference on Robotics and Automation(ICRA). IEEE, 2017, pp. 5753–5758.

[44] N. Hansen, “The cma evolution strategy: a comparing review,” inTowards a new evolutionary computation. Springer, 2006, pp. 75–102.

[45] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, “Rotors—a modulargazebo mav simulator framework,” in Robot Operating System (ROS).Springer, 2016, pp. 595–625.

[46] R. Mur-Artal and J. D. Tardos, “Orb-slam2: An open-source slamsystem for monocular, stereo, and rgb-d cameras,” IEEE Transactionson Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[47] “Google’s 3d warehouse.” [Online]. Available:http://3dwarehouse.sketchup.com/

[48] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples:Benchmarking large-scale scene reconstruction,” ACM Transactions onGraphics, vol. 36, no. 4, p. 78, 2017.

5291

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on September 24,2020 at 03:29:50 UTC from IEEE Xplore. Restrictions apply.