live semantic 3d perception for immersive augmented reality...3d point cloud data, makes high...

11
1077-2626 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Manuscript received 10 Sept. 2019; accepted 5 Feb. 2020. Date of publication 18 Feb. 2020; date of current version 27 Mar. 2020. Digital Object Identifier no. 10.1109/TVCG.2020.2973477 Live Semantic 3D Perception for Immersive Augmented Reality Lei Han, Tian Zheng, Yinheng Zhu, Lan Xu, and Lu Fang Geometry Mesh Semantic Labels Physics Engine AR rendering Tablet + RGB-D Cam Fig. 1: Overview of our live semantic 3D perception system and the augmented reality application built upon it. We take input from a RGBD camera, followed by a 3D geometric reconstruction and a 3D semantic segmentation process, producing semantic labels for each voxel. To demonstrate the efficiency, we further build an interactive AR application using a physics engine, where the user can shoot virtual balls into the scene. Each ball’s trajectory follows a realistic bouncing rate given by the semantic labels of different surfaces. The whole system runs online in a portable laptop. Abstract—Semantic understanding of 3D environments is critical for both the unmanned system and the human involved vir- tual/augmented reality (VR/AR) immersive experience. Spatially-sparse convolution, taking advantage of the intrinsic sparsity of 3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on 3D semantic segmentation problems. However, the exhaustive computations limits the practical usage of semantic 3D perception for VR/AR applications in portable devices. In this paper, we identify that the efficiency bottleneck lies in the unorganized memory access of the sparse convolution steps, i.e., the points are stored independently based on a predefined dictionary, which is inefficient due to the limited memory bandwidth of parallel computing devices (GPU). With the insight that points are continuous as 2D surfaces in 3D space, a chunk-based sparse convolution scheme is proposed to reuse the neighboring points within each spatially organized chunk. An efficient multi-layer adaptive fusion module is further proposed for employing the spatial consistency cue of 3D data to further reduce the computational burden. Quantitative experiments on public datasets demonstrate that our approach works 11× faster than previous approaches with competitive accuracy. By implementing both semantic and geometric 3D reconstruction simultaneously on a portable tablet device, we demo a foundation platform for immersive AR applications. Index Terms—Dense 3D Reconstruction, 3D Semantic Segmentation, 3D Convolutional Network, Virtual Reality, Augmented Reality 1 I NTRODUCTION 3D perception of real-world environments serves as a fundamental technique for applications such as mobile robots navigation, VR/AR related interactions, etc. With the popularity of RGBD sensors, various approaches [12, 15, 30, 46] have demonstrated significant potentials for live 3D geometry reconstruction of the environment. However, the mere geometric model can hardly satisfy the increasing demand on the semantic-level understanding of the 3D scenes, especially for boosting the intelligence of mobile robots. As a result, how to realize Lei Han, Tian Zheng, Yinheng Zhu, Lan Xu and Lu Fang are with Tsinghua University. E-mail: [email protected]. This work is done at Tsinghua University. Lei Han and Lan Xu are also with Hong Kong University of Science and Technology. E-mail: [email protected], [email protected]. Corresponding author: Lu Fang the joint geometric and semantic 3D reconstruction in realtime has recently attracted substantive attention from both the computer vision and computer graphics communities. With the rapid developments of deep learning techniques for image understanding tasks [7, 17, 23], a straightforward solution for seman- tic 3D perception is to utilize existing 2D image segmentation tech- niques and further project semantic labels from 2D pixels to the 3D space [27, 29, 40]. The problem of such image-based 3D segmentation technique lies in that each image only provides the observation of the environment from a single view. The inconsistent semantic observa- tions existing in consecutive 2D images need to be further filtered based on Bayesian estimation technique [27] or conditional random field [29]. Apparently, approaches that directly process the point clouds in the 3D space [8,14] seem to be more effective. Yet the 3D representa- tions are highly unstructured and cannot be trivially processed using conventional convolutional neural networks (CNN). The straightforward extension of convolution to 3D space by reshap- ing the weight kernels from 2D to 3D is impractical, because the hidden states in 3D convolution is linear to N 3 , where N is the resolution range. The recent works of 3D convolution emerge either as point-based or volumetric-based methods. The point-based methods that directly con- 2012 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

Upload: others

Post on 16-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

1077-2626 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Manuscript received 10 Sept. 2019; accepted 5 Feb. 2020.Date of publication 18 Feb. 2020; date of current version 27 Mar. 2020.Digital Object Identifier no. 10.1109/TVCG.2020.2973477

Live Semantic 3D Perception for Immersive Augmented Reality

Lei Han, Tian Zheng, Yinheng Zhu, Lan Xu, and Lu Fang

Geometry Mesh

Semantic Labels

Physics Engine AR rendering

Tablet + RGB-D Cam

Fig. 1: Overview of our live semantic 3D perception system and the augmented reality application built upon it. We take input froma RGBD camera, followed by a 3D geometric reconstruction and a 3D semantic segmentation process, producing semantic labelsfor each voxel. To demonstrate the efficiency, we further build an interactive AR application using a physics engine, where the usercan shoot virtual balls into the scene. Each ball’s trajectory follows a realistic bouncing rate given by the semantic labels of differentsurfaces. The whole system runs online in a portable laptop.

Abstract—Semantic understanding of 3D environments is critical for both the unmanned system and the human involved vir-tual/augmented reality (VR/AR) immersive experience. Spatially-sparse convolution, taking advantage of the intrinsic sparsity of3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on 3D semanticsegmentation problems. However, the exhaustive computations limits the practical usage of semantic 3D perception for VR/ARapplications in portable devices. In this paper, we identify that the efficiency bottleneck lies in the unorganized memory access of thesparse convolution steps, i.e., the points are stored independently based on a predefined dictionary, which is inefficient due to thelimited memory bandwidth of parallel computing devices (GPU). With the insight that points are continuous as 2D surfaces in 3D space,a chunk-based sparse convolution scheme is proposed to reuse the neighboring points within each spatially organized chunk. Anefficient multi-layer adaptive fusion module is further proposed for employing the spatial consistency cue of 3D data to further reducethe computational burden. Quantitative experiments on public datasets demonstrate that our approach works 11× faster than previousapproaches with competitive accuracy. By implementing both semantic and geometric 3D reconstruction simultaneously on a portabletablet device, we demo a foundation platform for immersive AR applications.

Index Terms—Dense 3D Reconstruction, 3D Semantic Segmentation, 3D Convolutional Network, Virtual Reality, Augmented Reality

1 INTRODUCTION

3D perception of real-world environments serves as a fundamentaltechnique for applications such as mobile robots navigation, VR/ARrelated interactions, etc. With the popularity of RGBD sensors, variousapproaches [12, 15, 30, 46] have demonstrated significant potentialsfor live 3D geometry reconstruction of the environment. However,the mere geometric model can hardly satisfy the increasing demandon the semantic-level understanding of the 3D scenes, especially forboosting the intelligence of mobile robots. As a result, how to realize

• Lei Han, Tian Zheng, Yinheng Zhu, Lan Xu and Lu Fang are with TsinghuaUniversity. E-mail: [email protected]. This work is done atTsinghua University.

• Lei Han and Lan Xu are also with Hong Kong University of Science andTechnology. E-mail: [email protected], [email protected].

• Corresponding author: Lu Fang

the joint geometric and semantic 3D reconstruction in realtime hasrecently attracted substantive attention from both the computer visionand computer graphics communities.

With the rapid developments of deep learning techniques for imageunderstanding tasks [7, 17, 23], a straightforward solution for seman-tic 3D perception is to utilize existing 2D image segmentation tech-niques and further project semantic labels from 2D pixels to the 3Dspace [27, 29, 40]. The problem of such image-based 3D segmentationtechnique lies in that each image only provides the observation of theenvironment from a single view. The inconsistent semantic observa-tions existing in consecutive 2D images need to be further filtered basedon Bayesian estimation technique [27] or conditional random field [29].Apparently, approaches that directly process the point clouds in the3D space [8, 14] seem to be more effective. Yet the 3D representa-tions are highly unstructured and cannot be trivially processed usingconventional convolutional neural networks (CNN).

The straightforward extension of convolution to 3D space by reshap-ing the weight kernels from 2D to 3D is impractical, because the hiddenstates in 3D convolution is linear to N3, where N is the resolution range.The recent works of 3D convolution emerge either as point-based orvolumetric-based methods. The point-based methods that directly con-

sume unordered point clouds attract a lot of attention since the pioneerwork of PointNet [34], for its high efficiency. But the quantitativeexperiments on public datasets [3,10] have demonstrated that the point-based methods suffer from much lower accuracy than the volumetric3D convolution based methods. On the other hand, volumetric-basedmethods [8, 13, 14, 37] utilize the sparsity nature of points in the 3Dvolumetric space and consider those voxels with surface inside merely.They achieve high accuracy with the sacrifice of realtime performance.

In this paper, aiming to maintain the high accuracy of volumetric-based methods while reducing the computational complexity signifi-cantly, we present an efficient high-resolution 3D convolution scheme,enabling online 3D semantic reconstruction in a portable device. Itworks 11× faster than previous approaches with competitive accuracy.Technically, given the basic modules of FlashFusion [15] for geometricreconstruction, and the sparse Convolutional Networks (SCN) [13] for3D convolution in volumetric space, the demand of online computationsin portable device imposes huge challenges on the algorithm design.

We start by identifying that the efficiency bottleneck of sparse 3DCNN lies in the unorganized memory access of the sparse convolutionsteps. In other words, the points are stored independently based on apredefined dictionary, which is inefficient due to the limited memorybandwidth of parallel computing devices (GPU). More specifically, theprevious sparse SCN-based methods [8, 14] assume that the spatiallocations of point clouds are independent and identical random vari-ables and thus, spatially sparse convolution is applied to each activevoxel independently using a rule-book based technique [14]. Spatiallysparse convolution operations are then implemented on a ND spatialneighborhood of each voxel as shown in the following Eqn. 1:

xoutu = ∑

i∈ND

Wixinu+i for u ∈ Aout, (1)

where N is a pre-defined kernel size, D indicates the dimension ofspatial space(equals to 3 for 3D convolution) and Wi is the weight matrixat location i for input features xin

u+i. A denotes the set for non-emptylocations. Unlike the dense CNN where input features are locatedorderly based on its spatial location, the addresses of input featuresxin

u+i in SCN are defined based on the spatial hashing [41] which mapsspatial locations randomly to a non-repeated integer. Since all theinput data are disorderly stored in GPU global memory, the convolutionprocess in SCN requires a large amount of random global memoryaccess which is the main bottleneck of the run-time performance forthe existing SCN methods [8, 14].

The above analysis reveals an insight that the spatial locations ofpoint clouds are evenly distributed on continuous object surfaces andare highly correlated in 3D space. Standing on it, we propose a chunk-based sparse convolution scheme. It reduces the amount of globalmemory accesses by sharing the neighboring features inside each chunkfor sparse convolution, which achieves a 4× overall speed up on typicalGPU devices. The features inside each chunk are accessed from theslow global memory only once and are cached in the high speed localmemory for later usages. Typically, the larger the chunk is, the memoryusage is more efficient with more local memory requirement. Due to thelimited size of the local memory on GPU devices, we adaptively choosethe size of each chunk-based on the sparsity of the input point cloudat the chunk’s location. To further reduce the computational burden,an efficient multi-layer adaptive fusion module is further proposedfor employing the spatial consistency cue of 3D data. In summary,contributions of this paper include:

• We propose an adaptive chunk-based 3D convolution strategyto reduce the random global memory access of GPU devicesin the sparse convolution steps, which achieves 4× speed upcompared to the conventional methods [8, 14] using the samenetwork structure without sacrificing accuracy.

• We propose a novel spatial attention based multi-resolution fea-ture fusion layer, which establishes new state-of-the-art perfor-mance on the S3DIS [4] dataset, with a mIoU score of 68.3%,overpassing previous state-of-the-art method [8] by a margin of

2.9%. A simplified 3D convolutional network structure is alsopresented achieving similar performance of [8] while being 11×faster.

• We present a real-time simultaneous 3D reconstruction and se-mantic segmentation system working on mobile devices. Theeffectiveness of our system is demonstrated using a live immer-sive AR demo, where the users can interact with the environmentbased on its materials information.

2 RELATED WORK

An overview of research works related to our proposed real-time 3Dperception system is presented in this section. We first provide anoverview of existing systems for the real-time joint geometric andsemantic 3D perception. Our proposed approach belongs to the categorythat directly predicts the semantic labels of the reconstructed 3D model,thus a detailed analysis on the algorithms for 3D semantic segmentationtasks is presented in Sec. 2.2.

2.1 Geometric and Semantic 3D PerceptionGeometric modeling of the environment with 3D Perception, includingboth geometric modeling and semantic understanding, is the fundamen-tal technique for various applications including robot perception andmixed reality. Various approaches [27–29, 33, 40] are proposed for thistarget by employing deep learning techniques for 2D image understand-ing and project the predicted image labels onto the reconstructed 3Dmodel for 3D point cloud segmentation. One of the early works is [27],which employs ElasticFusion [44] for dense 3D reconstruction of theenvironment and predicts the semantic probability of each class foreach pixel on observed images using Deconvolutional Semantic Seg-mentation network architecture [32]. ElasticFusion provides geometricconstraints for images observed from different view angles: the samesurfel is observed at different pixels from each image. The multiviewobservations are fused under a Bayesian framework. This pipeline isfollowed by later works with more robust geometric reconstructionmodules and image-based semantic segmentation techniques. [29, 33]further demonstrates the effectiveness of conditional random fields onfusing multiple observations of the same voxel from different viewangels for higher segmentation performance. The image based con-volution strategy is efficient in computation yet regards observationsfrom different view angles independently and neglects the geometricobservations from the 3D model of the environment, leading to de-graded performance in 3D understanding tasks [8,29]. In this paper, wedirectly estimate the semantic labels of each point of the reconstructedmodel in 3D space for robust semantic segmentation.

2.2 3D Semantic SegmentationUnlike images that are represented by densely organized pixels in2D space, 3D scene normally employs unordered point clouds forrepresentation, making it a tough problem to use convolutional neuralnetworks for 3D scene understanding. Various approaches are proposedto tackle this problem, which can be divided into three categories: pointbased methods [20, 21, 34, 35, 43] that directly consumes unorderedpoint clouds as input and use a permutation invariant neural networkfor feature extraction, multiview methods [11,18, 38] that aggregatesinformation of multiple 2D observations from different view angles ofthe 3D environment and voxel based methods [4, 14, 26, 45, 48] thatdivide the 3D space into voxels and apply 3D convolutional neuralnetworks on voxels like 2D convolution on pixels. In this paper, wefocused on 3D semantic segmentation using voxel based approaches,which have achieved high accuracy on various 3D scene understandingproblems [4, 5, 10] yet suffers from complexity burden that can hardlybe applied for real-time applications.

Unstructured point clouds can be assigned with voxels based onvoxelization of the 3D space, and apply 3D convolutional neural net-works for semantic segmentation as pioneered by [26, 45]. However,given the limited computational power and memory storage of mod-ern graphics processing unit, conventional 3D convolutional neuralnetworks can only work at coarse voxel level where most complexity

2012 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

Page 2: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

HAN ET AL.: LIVE SEMANTIC 3D PERCEPTION FOR IMMERSIVE AUGMENTED REALITY 2013

Live Semantic 3D Perception for Immersive Augmented Reality

Lei Han, Tian Zheng, Yinheng Zhu, Lan Xu, and Lu Fang

Geometry Mesh

Semantic Labels

Physics Engine AR rendering

Tablet + RGB-D Cam

Fig. 1: Overview of our live semantic 3D perception system and the augmented reality application built upon it. We take input froma RGBD camera, followed by a 3D geometric reconstruction and a 3D semantic segmentation process, producing semantic labelsfor each voxel. To demonstrate the efficiency, we further build an interactive AR application using a physics engine, where the usercan shoot virtual balls into the scene. Each ball’s trajectory follows a realistic bouncing rate given by the semantic labels of differentsurfaces. The whole system runs online in a portable laptop.

Abstract—Semantic understanding of 3D environments is critical for both the unmanned system and the human involved vir-tual/augmented reality (VR/AR) immersive experience. Spatially-sparse convolution, taking advantage of the intrinsic sparsity of3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on 3D semanticsegmentation problems. However, the exhaustive computations limits the practical usage of semantic 3D perception for VR/ARapplications in portable devices. In this paper, we identify that the efficiency bottleneck lies in the unorganized memory access of thesparse convolution steps, i.e., the points are stored independently based on a predefined dictionary, which is inefficient due to thelimited memory bandwidth of parallel computing devices (GPU). With the insight that points are continuous as 2D surfaces in 3D space,a chunk-based sparse convolution scheme is proposed to reuse the neighboring points within each spatially organized chunk. Anefficient multi-layer adaptive fusion module is further proposed for employing the spatial consistency cue of 3D data to further reducethe computational burden. Quantitative experiments on public datasets demonstrate that our approach works 11× faster than previousapproaches with competitive accuracy. By implementing both semantic and geometric 3D reconstruction simultaneously on a portabletablet device, we demo a foundation platform for immersive AR applications.

Index Terms—Dense 3D Reconstruction, 3D Semantic Segmentation, 3D Convolutional Network, Virtual Reality, Augmented Reality

1 INTRODUCTION

3D perception of real-world environments serves as a fundamentaltechnique for applications such as mobile robots navigation, VR/ARrelated interactions, etc. With the popularity of RGBD sensors, variousapproaches [12, 15, 30, 46] have demonstrated significant potentialsfor live 3D geometry reconstruction of the environment. However,the mere geometric model can hardly satisfy the increasing demandon the semantic-level understanding of the 3D scenes, especially forboosting the intelligence of mobile robots. As a result, how to realize

• Lei Han, Tian Zheng, Yinheng Zhu, Lan Xu and Lu Fang are with TsinghuaUniversity. E-mail: [email protected]. This work is done atTsinghua University.

• Lei Han and Lan Xu are also with Hong Kong University of Science andTechnology. E-mail: [email protected], [email protected].

• Corresponding author: Lu Fang

the joint geometric and semantic 3D reconstruction in realtime hasrecently attracted substantive attention from both the computer visionand computer graphics communities.

With the rapid developments of deep learning techniques for imageunderstanding tasks [7, 17, 23], a straightforward solution for seman-tic 3D perception is to utilize existing 2D image segmentation tech-niques and further project semantic labels from 2D pixels to the 3Dspace [27, 29, 40]. The problem of such image-based 3D segmentationtechnique lies in that each image only provides the observation of theenvironment from a single view. The inconsistent semantic observa-tions existing in consecutive 2D images need to be further filtered basedon Bayesian estimation technique [27] or conditional random field [29].Apparently, approaches that directly process the point clouds in the3D space [8, 14] seem to be more effective. Yet the 3D representa-tions are highly unstructured and cannot be trivially processed usingconventional convolutional neural networks (CNN).

The straightforward extension of convolution to 3D space by reshap-ing the weight kernels from 2D to 3D is impractical, because the hiddenstates in 3D convolution is linear to N3, where N is the resolution range.The recent works of 3D convolution emerge either as point-based orvolumetric-based methods. The point-based methods that directly con-

sume unordered point clouds attract a lot of attention since the pioneerwork of PointNet [34], for its high efficiency. But the quantitativeexperiments on public datasets [3,10] have demonstrated that the point-based methods suffer from much lower accuracy than the volumetric3D convolution based methods. On the other hand, volumetric-basedmethods [8, 13, 14, 37] utilize the sparsity nature of points in the 3Dvolumetric space and consider those voxels with surface inside merely.They achieve high accuracy with the sacrifice of realtime performance.

In this paper, aiming to maintain the high accuracy of volumetric-based methods while reducing the computational complexity signifi-cantly, we present an efficient high-resolution 3D convolution scheme,enabling online 3D semantic reconstruction in a portable device. Itworks 11× faster than previous approaches with competitive accuracy.Technically, given the basic modules of FlashFusion [15] for geometricreconstruction, and the sparse Convolutional Networks (SCN) [13] for3D convolution in volumetric space, the demand of online computationsin portable device imposes huge challenges on the algorithm design.

We start by identifying that the efficiency bottleneck of sparse 3DCNN lies in the unorganized memory access of the sparse convolutionsteps. In other words, the points are stored independently based on apredefined dictionary, which is inefficient due to the limited memorybandwidth of parallel computing devices (GPU). More specifically, theprevious sparse SCN-based methods [8, 14] assume that the spatiallocations of point clouds are independent and identical random vari-ables and thus, spatially sparse convolution is applied to each activevoxel independently using a rule-book based technique [14]. Spatiallysparse convolution operations are then implemented on a ND spatialneighborhood of each voxel as shown in the following Eqn. 1:

xoutu = ∑

i∈ND

Wixinu+i for u ∈ Aout, (1)

where N is a pre-defined kernel size, D indicates the dimension ofspatial space(equals to 3 for 3D convolution) and Wi is the weight matrixat location i for input features xin

u+i. A denotes the set for non-emptylocations. Unlike the dense CNN where input features are locatedorderly based on its spatial location, the addresses of input featuresxin

u+i in SCN are defined based on the spatial hashing [41] which mapsspatial locations randomly to a non-repeated integer. Since all theinput data are disorderly stored in GPU global memory, the convolutionprocess in SCN requires a large amount of random global memoryaccess which is the main bottleneck of the run-time performance forthe existing SCN methods [8, 14].

The above analysis reveals an insight that the spatial locations ofpoint clouds are evenly distributed on continuous object surfaces andare highly correlated in 3D space. Standing on it, we propose a chunk-based sparse convolution scheme. It reduces the amount of globalmemory accesses by sharing the neighboring features inside each chunkfor sparse convolution, which achieves a 4× overall speed up on typicalGPU devices. The features inside each chunk are accessed from theslow global memory only once and are cached in the high speed localmemory for later usages. Typically, the larger the chunk is, the memoryusage is more efficient with more local memory requirement. Due to thelimited size of the local memory on GPU devices, we adaptively choosethe size of each chunk-based on the sparsity of the input point cloudat the chunk’s location. To further reduce the computational burden,an efficient multi-layer adaptive fusion module is further proposedfor employing the spatial consistency cue of 3D data. In summary,contributions of this paper include:

• We propose an adaptive chunk-based 3D convolution strategyto reduce the random global memory access of GPU devicesin the sparse convolution steps, which achieves 4× speed upcompared to the conventional methods [8, 14] using the samenetwork structure without sacrificing accuracy.

• We propose a novel spatial attention based multi-resolution fea-ture fusion layer, which establishes new state-of-the-art perfor-mance on the S3DIS [4] dataset, with a mIoU score of 68.3%,overpassing previous state-of-the-art method [8] by a margin of

2.9%. A simplified 3D convolutional network structure is alsopresented achieving similar performance of [8] while being 11×faster.

• We present a real-time simultaneous 3D reconstruction and se-mantic segmentation system working on mobile devices. Theeffectiveness of our system is demonstrated using a live immer-sive AR demo, where the users can interact with the environmentbased on its materials information.

2 RELATED WORK

An overview of research works related to our proposed real-time 3Dperception system is presented in this section. We first provide anoverview of existing systems for the real-time joint geometric andsemantic 3D perception. Our proposed approach belongs to the categorythat directly predicts the semantic labels of the reconstructed 3D model,thus a detailed analysis on the algorithms for 3D semantic segmentationtasks is presented in Sec. 2.2.

2.1 Geometric and Semantic 3D PerceptionGeometric modeling of the environment with 3D Perception, includingboth geometric modeling and semantic understanding, is the fundamen-tal technique for various applications including robot perception andmixed reality. Various approaches [27–29, 33, 40] are proposed for thistarget by employing deep learning techniques for 2D image understand-ing and project the predicted image labels onto the reconstructed 3Dmodel for 3D point cloud segmentation. One of the early works is [27],which employs ElasticFusion [44] for dense 3D reconstruction of theenvironment and predicts the semantic probability of each class foreach pixel on observed images using Deconvolutional Semantic Seg-mentation network architecture [32]. ElasticFusion provides geometricconstraints for images observed from different view angles: the samesurfel is observed at different pixels from each image. The multiviewobservations are fused under a Bayesian framework. This pipeline isfollowed by later works with more robust geometric reconstructionmodules and image-based semantic segmentation techniques. [29, 33]further demonstrates the effectiveness of conditional random fields onfusing multiple observations of the same voxel from different viewangels for higher segmentation performance. The image based con-volution strategy is efficient in computation yet regards observationsfrom different view angles independently and neglects the geometricobservations from the 3D model of the environment, leading to de-graded performance in 3D understanding tasks [8,29]. In this paper, wedirectly estimate the semantic labels of each point of the reconstructedmodel in 3D space for robust semantic segmentation.

2.2 3D Semantic SegmentationUnlike images that are represented by densely organized pixels in2D space, 3D scene normally employs unordered point clouds forrepresentation, making it a tough problem to use convolutional neuralnetworks for 3D scene understanding. Various approaches are proposedto tackle this problem, which can be divided into three categories: pointbased methods [20, 21, 34, 35, 43] that directly consumes unorderedpoint clouds as input and use a permutation invariant neural networkfor feature extraction, multiview methods [11,18, 38] that aggregatesinformation of multiple 2D observations from different view angles ofthe 3D environment and voxel based methods [4, 14, 26, 45, 48] thatdivide the 3D space into voxels and apply 3D convolutional neuralnetworks on voxels like 2D convolution on pixels. In this paper, wefocused on 3D semantic segmentation using voxel based approaches,which have achieved high accuracy on various 3D scene understandingproblems [4, 5, 10] yet suffers from complexity burden that can hardlybe applied for real-time applications.

Unstructured point clouds can be assigned with voxels based onvoxelization of the 3D space, and apply 3D convolutional neural net-works for semantic segmentation as pioneered by [26, 45]. However,given the limited computational power and memory storage of mod-ern graphics processing unit, conventional 3D convolutional neuralnetworks can only work at coarse voxel level where most complexity

Page 3: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

2014 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

lies in the empty voxels without points. To address this problem , [37]proposes to use octree data structure where each node stores a pooledfeature representation with varied spatial size, e.g., coarse resolutionfor empty spaces (inactive voxels) and fine resolution for voxels thatcontain points (active voxels). [13] employs hashing techniques foractive voxels and omit inactive voxels in the convolution step and fur-ther proposes Submanifold Sparse Convolution (SSC) [14] to avoid the”dilate” effect of convolution operations by restricting inactive voxelsto be inactive during the forward step of the network. SSC achieveshigh performance with bounded theoretical FLOPs operation, yet itsrun-time is limited by irregular sparse data representation. [36] pro-poses tiled sparse convolution module and process each blocks of dataas dense convolution which could be processed efficiently by conven-tional dense convolution operations. However, it only supports 2Dconvolution where pixels are densely organized in each small blockdue to the fact that dense convolution operates on the inactive voxelsas well. Recently, Minkowski Net [8] is proposed as a framework forgeneralized sparse convolution employing the sparse tensors of Pytorchlibrary for efficient convolution operations. In this paper, an adap-tive chunk-based convolutional strategy is proposed to further improvethe efficiency of sparse 3D convolution. The efficiency gain comesfrom the observation that despite sparse, point clouds are continuouslydistributed on object surfaces in 3D space, where reusing features ofnear-by voxels is possible based on the proposed chunk data structurefor spatially sparse convolution operations.

3 METHODS

A real-time geometric and semantic 3D perception system is presentedin this paper as illustrated in Fig. 1. The reconstructed 3D model andsemantic labels are sent to a physics engine (Unity3D is used in ourexperiments) that could be further used for immersive human-involvedinteractions with the environment as shown in Sec. 4.4. Particularly,we demonstrate the effectiveness of the proposed system by throwing avirtual ball to the real world scene. Given the geometric 3D model, theball’s trajectory is controlled by its initial speed, gravity and reboundswhen it comes up against object surfaces. Given the semantic 3D model,we can determine the damping factor of its collision process basedon different objects, e.g., small damping factor for sofa while largedamping factor for floor as they are composed by different materialsdetermined by semantic priors.

FlashFusion [15] is employed for geometric modeling of the envi-ronment, which takes RGBD sequences as input, tracks camera posesat a front-end thread for each frame and generates a globally consis-tent dense 3D model at key frame rate in the back-end thread. 3Dsurfaces are represented implicitly as the zero level set of the signeddistance function and extracted as triangle meshes using MarchingCube [24] algorithm. A signed distance function is defined on the volu-metric space at a fixed resolution (E.g., 5mm in our experiments). Forlarge scale dense 3D reconstruction, a spatial hashing [31] technique isadopted where continuous N ×N ×N voxels are organized as a chunkand each chunk is mapped to an address based on the spatial hashfunction [41]. Only chunks that are close to object surfaces are storedwhich significantly reduces the memory requirements compared withdense volumetric representations. FlashFusion employs FastGO [16]for globally consistent camera pose estimation and proposes sparsevoxel sampling technique for efficient signed distance function fusion,enabling efficient dense 3D reconstruction using CPU computing onportable devices.

For semantic understanding of the environment, we adopt the vol-umetric 3D convolution for point cloud segmentation. The spatially-sparse convolutional network that firstly proposed by [13, 14] is em-ployed as the basic framework for the convolutional operations. Al-though the volumetric-based methods have achieved impressive per-formance on 3D semantic understanding problems, the complexitylimits their usages for real-time applications on portable devices. InSec. 3.1, an adaptive chunk-based sparse convolution module is intro-duced which enables our real-time semantic understanding on portabledevices. The network structure used for 3D semantic segmentation inthis work is illustrated in Sec. 3.2, which achieves superior performance

on public indoor datasets as shown in the experiments in Sec. 4.

3.1 Adaptive chunk-based sparse convolutionIn order to clearly propose our method, firstly we will give a briefbackground on how spatially sparse convolution works. Afterwards,we explain our method in general, followed by some implementationdetails.

3.1.1 Spatially sparse convolutional networks

Compared to the traditional 2D dense convolution widely used in thecontext of image recognition and segmentation, direct 3D dense con-volution can be very costly and inefficient due to the large number ofweights and the high-dimensional input space. Therefore, we have totake advantage of the intrinsic sparsity of the 3D point cloud data andmake the cost of computation tractable. Among several approachesoptimized for 3D convolution, we adopt the spatially sparse convolu-tion. As described in a few sparse convolution frameworks [8] [14], thesparse convolution basically discards empty locations, only preservingand executing convolution on those non-empty locations of the inputdata, thus saving a lot of memory and computation.

To put the sparse convolution formally, we let xinu ∈RCin be the input

feature located at u ∈RD(D-dimensional coordinate) with Cin channels.We define the kernel weights as W ∈ RKD×Cin×Cout . We slice W by thefirst dimension, resulting a list of matrices of size Cin ×Cout. The i-thslice of W is denoted as Wi, i ∈ {1,2, · · · ,KD}. A single convolutionforward pass is formulated as down below:

xoutu = ∑

i∈ND(K)∩{i|u+i∈Ain}Wixin

u+i for u ∈ Aout, (2)

where ND(K) is the convolution reception field defined by the kernelshape and is centered at the origin, e.g, N1(3) = {−1,0,1}. A denotesthe set for non-empty locations. Consider the simplest case of a paddedconvolution with the kernel stride being 1, where the spatial size re-mains unchanged, Aout is the same as Ain. Note that only non-emptylocations are involved in computation.

In the traditional way, we conduct sparse convolution as Alg. 1.First, we generate a rulebook R based on Ain,Aout and kernel shape.Each entry (I j,O j) in Ri indicates the feature located at I j should bemultiplied by the i-th weight and then added to the feature located atO j.

Algorithm 1 Spatially Sparse Convolution

Require: Input features Fi, output feature placeholder Fo, kernelweights W, rulebook R, KD represents the kernel volume,

1: procedure GENERATE RULEBOOK2: for all Oi ∈ Aout do3: for all Ii ∈ Ain ∩ReceptionField of Oi do4: offset ← The position of Ii in reception field5: Append (Ii,Oi) to Roffset

6: procedure SPARSE CONVOLUTION7: Fo ← 08: for all i ∈ {0,1, · · · ,KD −1} do9: for all (I j,O j) ∈ Ri do

10: FO j ← FO j +Wi ∗FIj // Optional to add bias

As explained above, the rulebook R indicates the memory addressesfor input/output features required in a convolution operation. Rulebookis needed because sparse data are stored continuously and unorderlyin memory. We use rulebook to clarify the relationship between thespatial locations and the memory address.

Intuitive as it seems, there exists a major pitfall in performance whenit comes to GPU parallelization. Since all input data are unorderlystored in GPU global memory, this algorithm requires a large amountof random un-coalesced global memory access, which would stronglylimit GPU memory bandwidth.

3.1.2 Adaptive chunk-based sparse convolutionIn order to remove the obstacle of random GPU global memory access,we adopt a chunk-based method. It is based on the observation thateach input locations are accessed multiple times when generating dif-ferent output features in its neighbor. Therefore, by splitting the inputspace into independent chunks, voxels inside each chunk have a highprobability of using the same input features for convolution, which areonly required to be accessed only once and are cached into the localmemory for later usage, reducing the random global memory access bya huge margin.

First, we split the input locations into chunks of size CD, in the caseof 3D sparse convolution, cubes of size C ∗C ∗C. We then performsparse convolution independently for each chunk in parallel. As shownin Alg. 2, a group of threads performing convolution inside a chunkends up sharing the same piece of input features and weights data.Therefore, we are able to cache them in shared memory.

Algorithm 2 Chunk-based sparse convolution

Require: Chunk input features Fi, Chunk output feature placeholderFo, kernel weights W, rulebook list R, KD represents the kernelvolume,

1: procedure CONVOLUTION FOR A SINGLE CHUNK2: Generate Rulebook Rchunk

3: Fo ← 04: Copy W,Fi into shared memory5: for all i ∈ {0,1, · · · ,KD −1} in parallel do6: for all (I j,O j) ∈ Rchunk

i in parallel do7: FO j ← FO j +Wi ∗FIj

We further explain our implementation with regard to the parallelstructure in detail. Fig. 2 illustrates the sparse convolution forwardoperation for a single chunk, where (Nin,Nout) represents number ofnon-empty features in the input region and the output region. (Ci,Co)represents the number of the input/output channels, KD represents thekernel volume, B is a parameter for subdividing. Note that the ”∗”operator here means a sparse convolution guided by the rulebook R.Here, Nin,Nout,Ci,Co is subdivided by B, since they can be large inpractice. For a single chunk, we run in parallel Co/B CUDA blocks,each assigned to generate B output channels. Within each block, welaunch threads as dimension (Nout/B,B,B), each assigned to computethe partial convolution of B input channels and add to the output. Thenwe get the final result by looping over all input channels. Within aCUDA block, the amount of shared data is the (Nin ∗B) part of theinput features and the (B ∗B ∗KD) part of weights, which is markedout in Fig. 2.

Since the distribution of sparse data is often uneven, direct chunksplitting may generate chunks containing drastically different numberof points. If the number of points inside is too large, it may exceedthe capacity of shared memory. If it is too small, the GPU threadsmay not be fully used. Therefore, we adopt an adaptive chunk splittingmethod. Our algorithm adaptively divide the spatial area into chunks ina multiple-level manner. Fig. 3 shows the example of the 2D adaptivechunk splitting with 3 levels. At the first level, we split all the outputlocations into squares of size C ∗C. Then we evaluate the numberof non-empty points in InputRegion, which is the set of possibleinput locations involved to generate outputs within a chunk. Fig. 4shows a simple example where kernel size K = 3, Chunk size C = 4,InputRegion is essentially the chunk region plus a (K −1)/2 margin.It also indicates how the input regions of neighboring chunks haveoverlap. The number of non-empty points in InputRegion actuallyindicates the memory needed to cache input features into shared mem-ory. At level 2 and 3, we further split the chunks which are too large toprocess in a GPU thread block into chunks of size C/2 and C/4.

Eventually, we manage to divide all the data into a list of chunks ofdifferent sizes. Since 3D points are unevenly distributed, it is importantthat we make sure all chunks have roughly the same number of datapoints. We take larger chunks where points are sparse, whereas take

B

N

Cin Cout

KD

∗ =

Fig. 2: Parallel structure for chunk-wise sparse convolution: Chunksare processed independently in parallel. For each chunk, a singleGPU thread block is assigned to generate B output feature channels,as marked out in green box. For each iteration, the block processes Binput feature channels, as marked out in orange box. In each iteration,the marked part of input features and weights are cached in sharedmemory.

(a) Level 1 (b) Level 2 (c) Level 3

Fig. 3: Adaptive chunk splitting: In the example of 2D, each dotrepresents a non-empty output location. The output space is firstly splitinto squares of size C. Then we take chunks that contain too manyinput data and pass it into next level of splitting. Each level halves thechunk size.

finer chunks where points are dense. After the adaptive splitting, thesechunks are balanced in the sense that the number of data points ineach chunk will not exceed the maximum capacity for a single GPUthread block, whereas avoid over-splitting, causing inefficiency whenprocessed in parallel.

Note that in order to improve performance, we perform each levelof splitting and rulebook generating in parallel using CUDA. Becauseall the sparse locations are stored in the form of a hash table in A, it isintuitive that we use a GPU-based parallel hash table implementation[1] to achieve high performance. We use the following hash functionfor 3D coordinates:

key = (((R∗P) OR x)∗P OR y)∗P OR z, (3)

where (x,y,z) is the input 3D coordinates, P = 16777619 is a primenumber, R is a random integer for the purpose of generating multiplehash functions.

3.2 Spatial Attention for Cross-scale Fusion moduleBenefited by the adaptive chunk-based sparse representation, the effi-ciency of processing 3D data are dramatically improved, and furthermore, it gives more flexibility and guidance to the framework designfrom previous 2D network architecture exploration. However the “copy-to-3D-like innovation” may not be acclimatized, because of the gapof characteristic between 2D and 3D data. For example, dilated con-volution [47] is wildly used in 2D segmentation architecture [2, 9]while inherently not suitable for 3D task because the sparsity of acti-vation lead to zero-value almost everywhere in output tensor and thealternative way that use neighbor-search brings heavy computationalburden. Another example is that many edge-enhance techniques [6, 25]

Page 4: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

HAN ET AL.: LIVE SEMANTIC 3D PERCEPTION FOR IMMERSIVE AUGMENTED REALITY 2015

lies in the empty voxels without points. To address this problem , [37]proposes to use octree data structure where each node stores a pooledfeature representation with varied spatial size, e.g., coarse resolutionfor empty spaces (inactive voxels) and fine resolution for voxels thatcontain points (active voxels). [13] employs hashing techniques foractive voxels and omit inactive voxels in the convolution step and fur-ther proposes Submanifold Sparse Convolution (SSC) [14] to avoid the”dilate” effect of convolution operations by restricting inactive voxelsto be inactive during the forward step of the network. SSC achieveshigh performance with bounded theoretical FLOPs operation, yet itsrun-time is limited by irregular sparse data representation. [36] pro-poses tiled sparse convolution module and process each blocks of dataas dense convolution which could be processed efficiently by conven-tional dense convolution operations. However, it only supports 2Dconvolution where pixels are densely organized in each small blockdue to the fact that dense convolution operates on the inactive voxelsas well. Recently, Minkowski Net [8] is proposed as a framework forgeneralized sparse convolution employing the sparse tensors of Pytorchlibrary for efficient convolution operations. In this paper, an adap-tive chunk-based convolutional strategy is proposed to further improvethe efficiency of sparse 3D convolution. The efficiency gain comesfrom the observation that despite sparse, point clouds are continuouslydistributed on object surfaces in 3D space, where reusing features ofnear-by voxels is possible based on the proposed chunk data structurefor spatially sparse convolution operations.

3 METHODS

A real-time geometric and semantic 3D perception system is presentedin this paper as illustrated in Fig. 1. The reconstructed 3D model andsemantic labels are sent to a physics engine (Unity3D is used in ourexperiments) that could be further used for immersive human-involvedinteractions with the environment as shown in Sec. 4.4. Particularly,we demonstrate the effectiveness of the proposed system by throwing avirtual ball to the real world scene. Given the geometric 3D model, theball’s trajectory is controlled by its initial speed, gravity and reboundswhen it comes up against object surfaces. Given the semantic 3D model,we can determine the damping factor of its collision process basedon different objects, e.g., small damping factor for sofa while largedamping factor for floor as they are composed by different materialsdetermined by semantic priors.

FlashFusion [15] is employed for geometric modeling of the envi-ronment, which takes RGBD sequences as input, tracks camera posesat a front-end thread for each frame and generates a globally consis-tent dense 3D model at key frame rate in the back-end thread. 3Dsurfaces are represented implicitly as the zero level set of the signeddistance function and extracted as triangle meshes using MarchingCube [24] algorithm. A signed distance function is defined on the volu-metric space at a fixed resolution (E.g., 5mm in our experiments). Forlarge scale dense 3D reconstruction, a spatial hashing [31] technique isadopted where continuous N ×N ×N voxels are organized as a chunkand each chunk is mapped to an address based on the spatial hashfunction [41]. Only chunks that are close to object surfaces are storedwhich significantly reduces the memory requirements compared withdense volumetric representations. FlashFusion employs FastGO [16]for globally consistent camera pose estimation and proposes sparsevoxel sampling technique for efficient signed distance function fusion,enabling efficient dense 3D reconstruction using CPU computing onportable devices.

For semantic understanding of the environment, we adopt the vol-umetric 3D convolution for point cloud segmentation. The spatially-sparse convolutional network that firstly proposed by [13, 14] is em-ployed as the basic framework for the convolutional operations. Al-though the volumetric-based methods have achieved impressive per-formance on 3D semantic understanding problems, the complexitylimits their usages for real-time applications on portable devices. InSec. 3.1, an adaptive chunk-based sparse convolution module is intro-duced which enables our real-time semantic understanding on portabledevices. The network structure used for 3D semantic segmentation inthis work is illustrated in Sec. 3.2, which achieves superior performance

on public indoor datasets as shown in the experiments in Sec. 4.

3.1 Adaptive chunk-based sparse convolutionIn order to clearly propose our method, firstly we will give a briefbackground on how spatially sparse convolution works. Afterwards,we explain our method in general, followed by some implementationdetails.

3.1.1 Spatially sparse convolutional networks

Compared to the traditional 2D dense convolution widely used in thecontext of image recognition and segmentation, direct 3D dense con-volution can be very costly and inefficient due to the large number ofweights and the high-dimensional input space. Therefore, we have totake advantage of the intrinsic sparsity of the 3D point cloud data andmake the cost of computation tractable. Among several approachesoptimized for 3D convolution, we adopt the spatially sparse convolu-tion. As described in a few sparse convolution frameworks [8] [14], thesparse convolution basically discards empty locations, only preservingand executing convolution on those non-empty locations of the inputdata, thus saving a lot of memory and computation.

To put the sparse convolution formally, we let xinu ∈RCin be the input

feature located at u ∈RD(D-dimensional coordinate) with Cin channels.We define the kernel weights as W ∈ RKD×Cin×Cout . We slice W by thefirst dimension, resulting a list of matrices of size Cin ×Cout. The i-thslice of W is denoted as Wi, i ∈ {1,2, · · · ,KD}. A single convolutionforward pass is formulated as down below:

xoutu = ∑

i∈ND(K)∩{i|u+i∈Ain}Wixin

u+i for u ∈ Aout, (2)

where ND(K) is the convolution reception field defined by the kernelshape and is centered at the origin, e.g, N1(3) = {−1,0,1}. A denotesthe set for non-empty locations. Consider the simplest case of a paddedconvolution with the kernel stride being 1, where the spatial size re-mains unchanged, Aout is the same as Ain. Note that only non-emptylocations are involved in computation.

In the traditional way, we conduct sparse convolution as Alg. 1.First, we generate a rulebook R based on Ain,Aout and kernel shape.Each entry (I j,O j) in Ri indicates the feature located at I j should bemultiplied by the i-th weight and then added to the feature located atO j.

Algorithm 1 Spatially Sparse Convolution

Require: Input features Fi, output feature placeholder Fo, kernelweights W, rulebook R, KD represents the kernel volume,

1: procedure GENERATE RULEBOOK2: for all Oi ∈ Aout do3: for all Ii ∈ Ain ∩ReceptionField of Oi do4: offset ← The position of Ii in reception field5: Append (Ii,Oi) to Roffset

6: procedure SPARSE CONVOLUTION7: Fo ← 08: for all i ∈ {0,1, · · · ,KD −1} do9: for all (I j,O j) ∈ Ri do

10: FO j ← FO j +Wi ∗FIj // Optional to add bias

As explained above, the rulebook R indicates the memory addressesfor input/output features required in a convolution operation. Rulebookis needed because sparse data are stored continuously and unorderlyin memory. We use rulebook to clarify the relationship between thespatial locations and the memory address.

Intuitive as it seems, there exists a major pitfall in performance whenit comes to GPU parallelization. Since all input data are unorderlystored in GPU global memory, this algorithm requires a large amountof random un-coalesced global memory access, which would stronglylimit GPU memory bandwidth.

3.1.2 Adaptive chunk-based sparse convolutionIn order to remove the obstacle of random GPU global memory access,we adopt a chunk-based method. It is based on the observation thateach input locations are accessed multiple times when generating dif-ferent output features in its neighbor. Therefore, by splitting the inputspace into independent chunks, voxels inside each chunk have a highprobability of using the same input features for convolution, which areonly required to be accessed only once and are cached into the localmemory for later usage, reducing the random global memory access bya huge margin.

First, we split the input locations into chunks of size CD, in the caseof 3D sparse convolution, cubes of size C ∗C ∗C. We then performsparse convolution independently for each chunk in parallel. As shownin Alg. 2, a group of threads performing convolution inside a chunkends up sharing the same piece of input features and weights data.Therefore, we are able to cache them in shared memory.

Algorithm 2 Chunk-based sparse convolution

Require: Chunk input features Fi, Chunk output feature placeholderFo, kernel weights W, rulebook list R, KD represents the kernelvolume,

1: procedure CONVOLUTION FOR A SINGLE CHUNK2: Generate Rulebook Rchunk

3: Fo ← 04: Copy W,Fi into shared memory5: for all i ∈ {0,1, · · · ,KD −1} in parallel do6: for all (I j,O j) ∈ Rchunk

i in parallel do7: FO j ← FO j +Wi ∗FIj

We further explain our implementation with regard to the parallelstructure in detail. Fig. 2 illustrates the sparse convolution forwardoperation for a single chunk, where (Nin,Nout) represents number ofnon-empty features in the input region and the output region. (Ci,Co)represents the number of the input/output channels, KD represents thekernel volume, B is a parameter for subdividing. Note that the ”∗”operator here means a sparse convolution guided by the rulebook R.Here, Nin,Nout,Ci,Co is subdivided by B, since they can be large inpractice. For a single chunk, we run in parallel Co/B CUDA blocks,each assigned to generate B output channels. Within each block, welaunch threads as dimension (Nout/B,B,B), each assigned to computethe partial convolution of B input channels and add to the output. Thenwe get the final result by looping over all input channels. Within aCUDA block, the amount of shared data is the (Nin ∗B) part of theinput features and the (B ∗B ∗KD) part of weights, which is markedout in Fig. 2.

Since the distribution of sparse data is often uneven, direct chunksplitting may generate chunks containing drastically different numberof points. If the number of points inside is too large, it may exceedthe capacity of shared memory. If it is too small, the GPU threadsmay not be fully used. Therefore, we adopt an adaptive chunk splittingmethod. Our algorithm adaptively divide the spatial area into chunks ina multiple-level manner. Fig. 3 shows the example of the 2D adaptivechunk splitting with 3 levels. At the first level, we split all the outputlocations into squares of size C ∗C. Then we evaluate the numberof non-empty points in InputRegion, which is the set of possibleinput locations involved to generate outputs within a chunk. Fig. 4shows a simple example where kernel size K = 3, Chunk size C = 4,InputRegion is essentially the chunk region plus a (K −1)/2 margin.It also indicates how the input regions of neighboring chunks haveoverlap. The number of non-empty points in InputRegion actuallyindicates the memory needed to cache input features into shared mem-ory. At level 2 and 3, we further split the chunks which are too large toprocess in a GPU thread block into chunks of size C/2 and C/4.

Eventually, we manage to divide all the data into a list of chunks ofdifferent sizes. Since 3D points are unevenly distributed, it is importantthat we make sure all chunks have roughly the same number of datapoints. We take larger chunks where points are sparse, whereas take

B

N

Cin Cout

KD

∗ =

Fig. 2: Parallel structure for chunk-wise sparse convolution: Chunksare processed independently in parallel. For each chunk, a singleGPU thread block is assigned to generate B output feature channels,as marked out in green box. For each iteration, the block processes Binput feature channels, as marked out in orange box. In each iteration,the marked part of input features and weights are cached in sharedmemory.

(a) Level 1 (b) Level 2 (c) Level 3

Fig. 3: Adaptive chunk splitting: In the example of 2D, each dotrepresents a non-empty output location. The output space is firstly splitinto squares of size C. Then we take chunks that contain too manyinput data and pass it into next level of splitting. Each level halves thechunk size.

finer chunks where points are dense. After the adaptive splitting, thesechunks are balanced in the sense that the number of data points ineach chunk will not exceed the maximum capacity for a single GPUthread block, whereas avoid over-splitting, causing inefficiency whenprocessed in parallel.

Note that in order to improve performance, we perform each levelof splitting and rulebook generating in parallel using CUDA. Becauseall the sparse locations are stored in the form of a hash table in A, it isintuitive that we use a GPU-based parallel hash table implementation[1] to achieve high performance. We use the following hash functionfor 3D coordinates:

key = (((R∗P) OR x)∗P OR y)∗P OR z, (3)

where (x,y,z) is the input 3D coordinates, P = 16777619 is a primenumber, R is a random integer for the purpose of generating multiplehash functions.

3.2 Spatial Attention for Cross-scale Fusion moduleBenefited by the adaptive chunk-based sparse representation, the effi-ciency of processing 3D data are dramatically improved, and furthermore, it gives more flexibility and guidance to the framework designfrom previous 2D network architecture exploration. However the “copy-to-3D-like innovation” may not be acclimatized, because of the gapof characteristic between 2D and 3D data. For example, dilated con-volution [47] is wildly used in 2D segmentation architecture [2, 9]while inherently not suitable for 3D task because the sparsity of acti-vation lead to zero-value almost everywhere in output tensor and thealternative way that use neighbor-search brings heavy computationalburden. Another example is that many edge-enhance techniques [6, 25]

Page 5: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

2016 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

Output locationsInput Region

Chu

nksi

ze

K

Fig. 4: Evaluate the number of input data by input region: For theexample where kernel size K = 3, chunk size C = 4, input region islarger than output region by a margin of size (K −1)/2.

in 2d segmentation task turn out to be redundant in 3D, of which thereason lays in the fact that 3D geometry has pre-separated boundarybetween instances while the projective geometry of 3D world ,i.e. 2Dimage, blends multiple instances along depth dimension and thereforethe boundary requires extra processing.

Considering the specificity mentioned above, we propose our SpatialAttention based Cross-scale Fusion Module. For the ease of illustration,we first define the notation and framework in Sec. 3.2.1, then presentour fusion module in Sec. 3.2.2.

3.2.1 Notation and Framework

Recall that the 3D segmentation problem refers to given voxel grid Vin ∈RH×W×D×C, finding the semantic counterpart Vlabel ∈ RH×W×D×1,where the H×W ×D×C stand for three spatial dimensions and featuredimension respectively. However with the sparse representation, theconceptive 4D tensor is in fact stored and computed in form of chunk-based representation presented in Sec. 3.1. Although we can regardit as dense tensor conceptually, the memory and computing powerconsumption only related to the number of spatial point denoted asP. In this section, the voxel grid refers to the chunk-based feature asshown in Fig. 2 without special explanation.

Using our framework the estimation of semantic label can be ob-tained by:

Vlabel = f (Vin), (4)

where f (·) denotes the proposed network.The proposed network enhance pyramid encoder-decoder architec-

ture with our Cross-scale Fusion module. We first briefly go throughthe architecture and then present more details about the module in thenext subsection.

The encoder of the architecture stacks L layers(from 0 to L−1) ofconv module that follows a extract-downscale manner like commonpractice(see details in Tab. 1). Briefly, given input Vin ∈ RH×W×D×C,the encoder feature Fencoder

i ∈ RH2i ×W

2i × D2i ×Ci is defined as:

Fencoderi =

{f extracti (Vin) i = 0

f extracti ( f downscale

i (Fencoderi−1 )) 1 ≤ i ≤ L−1 (5)

where the f extracti (·) and f downscale

i (·) stands for the feature extract anddownscale module defined in Tab. 1.

The decoder follows a corresponding hierarchical upscale-fusionmanner to generate feature Fdecoder

i−1 in each scale, while, on the contrary,

the fusion module f f usioni takes Fdecoder

j , i ≤ j ≤ L−1 as fusion source.Specifically, the Fdecoder

i is defined as follow:

Fdecoderi−1 =

{f upscalei (Fencoder

i ) i = L−1f upscalei (Fencoder

i )+ f f usioni (Φ) 1 ≤ i ≤ L−2

(6)

… …

Bat

ch

Nor

mal

izat

ion

Dec

onv

Rew

eigh

t

Res

Blo

ck

… FjdecoderFjdecoder Fi

decoderFidecoder…FL−2

decoderFL−2decoder

FidecoderFidecoder

Fi−1decoderFi−1decoder

Spatial Attention Path:(L-2,i-1)

Fi−1encoderFi−1encoderfi

upscale

Fi−1, jenhanced

Spatial Attention Path:(j,i-1)

Spatial Attention Path:(i,i-1)

*

Trilinear

bi−1, jbi−1, jFjdecoderFjdecoder

Ai−1, jAi−1, j

ai−1, jai−1, j

*

Spatial Attention Path:(j,i-1)

fsigmoid

Fi−1, jenhanced

Ele-product

Fi−1,L−2enhanced Fi−1,i

enhanced

… …

( j +1)M( j +1)M

Pj Pi

(i +1)M

PL−2

(L−1)M

Fig. 5: Architecture of Spatial attention based Cross-scale FusionModule. The upper sub-figure shows the overview of fusion from multi-scale paths while the lower sub-figure zooms in on a specific SpatialAttention Path and presents in details.

where Φ = Fdecoderi , ...,Fdecoder

L−1 and f upscalei (·) denotes the deconv

module defined in Tab. 1 . Shrink the channel of Fdecoder0 by linear

transformation along channel, and the result Vlabel can then be usedto optimization. Given a training set {V in

k ,V labelk }N

k=1, the target isto minimize the loss denoted as L = ∑N

k=1 H(V labelk ,V label

k ), where Hdenoted cross entropy function.

3.2.2 Cross-scale Fusion moduleTowards effective yet efficient cross-scale feature fusion, we propose anovel spatial attention based fusion scheme that first re-weight featuregrid from each source scale by exploiting implicit semantic information,then fuse the enhanced feature grid by addition. As mentioned inprevious study [22], the multi-scale pyramidal feature maps are capableof approximate category label. In other words, the Fdecoder

i is capable

of containing sufficient semantic information and can serve as so called”category prior” to further adjust intensity for each source scale feature.

The intuition behind this scheme is to give pyramidal network theflexibility to use different spatial attention distribution in differentsource-to-target path. Although deep network is capable of learningthis non-local transformation by stacking layers, our design helps toachieve a more compact and efficient architecture.

Specifically, as shown in Fig. 5 Cross-scale Fusion module takesmulti-scale features as input to obtain Fdecoder

i−1 ,1 ≤ i ≤ L− 2. For aspecific Fdecoder

j , i ≤ j ≤ L−2, the attention weight is obtain by:

Ai−1, j = fsigmoid(Fdecoderj ∗ai−1, j), (7)

where * denotes the matrix product along channel dimension, ai−1, j ∈RCj×1 is trainable parameter of linear transformation and Ai−1, j ∈R

H2 j × W

2 j × D2 j ×1 denotes the attention weight from scale j to scale i−1,

which is so called source-target path.Then apply the weight to the grid, upsample and merge by addition

as follow:

Fenhancedi−1, j = ftrilinear((F

decoderj ∗bi−1, j)�Ai−1, j), (8)

f f usioni (Φ) =

L−2

∑j=i

Fenhancedi−1, j , (9)

where � denotes element wise product with broadcast, bi−1, j ∈RCj×Cj

is trainable parameter of linear transformation and ftrilinear(·) is trilin-ear upsample function with scale factor 2 j−i+1 in H,W,D dimension.

Table 1: Details of architecture, where M is the channel width ofnetwork and we take M = 32 for all experiments.

Module Layer K S Cin Coutf extracti bn+sub-conv1 3 1 M*(i+1) M*(i+1)

bn+sub-conv2 3 1 M*(i+1) M*(i+1)resnet addition

f downscalei bn1+conv1 2 2 M*(i+1) M*(i+2)

f upscalei bn+deconv1 2 2 M*(i+1) M*i

channel linear1 M*i M*iadditionbn+sub-conv1 3 1 M*i M*ibn+sub-conv2 3 1 M*i M*iresnet addition

4 EXPERIMENTS

In this section, we first report the implementation details of the proposedmethod, followed by the qualitative and quantitative analysis of ourmethod on real-world sequences, as well as the efficiency evaluationof our method. An augmented reality application to demonstrate theeffectiveness of our method is provided in the last subsection.

The proposed approach is evaluated on both public dataset and real-world experiments. For public dataset evaluation, Stanford Large-Scale3D Indoor Space Dataset (S3DIS) [4] is employed, which contains6 large-scale indoor areas from 3 different buildings covering over6000m2 with 13 object classes. All the experiments on S3DIS areconducted on a single NVIDIA 1080Ti GPU and an Intel(R) Xeon(R)E5-2650 CPU. For real world experiments, we use the laptop MicrosoftSurface Book 2 with a NVIDIA GTX 1060 (Mobile) GPU and an IntelCore i7-8650U CPU as computing device. An extra RGBD sensorASUS xtion mounted to the laptop is utilized to captured an input liveRGBD stream in 30 fps with 640×480 resolution.

4.1 Qualitative AnalysisWe compare our method with previous state-of-the-art methods forpoint cloud segmentation, including the point-based method Point-Net [34] and the volumetric method TangentConv [39] on the validation

Table 2: Performance evaluation on S3DIS dataset in terms of bothaccuracy and efficiency.

mIoU (%) Efficiency (s) Parameters (M)PointNet [34] 41.3 0.15 3.6

PointNet++ [35] 52.3 0.86 1.0TangentConv [39] 52.8 0.59 0.33

MKN (s20) [8] 62.6 0.98 37.9MKN (s32) [8] 65.4 2.15 37.9CBSCN (s20) 65.5 0.18 9.6CBSCN (s50) 68.3 0.54 15.3

set of S3DIS [4] dataset. As illustrated in Fig. 6, our method producesless noisy and smoother semantic labels, leading to better semanticsegmentation performance. For example, the ’door’ label in our resultsis cleaner than that in other methods, as is marked in red circles. Addi-tionally, as is marked in blue circles, the ’bookcase’ label also showsbetter accuracy than other methods.

For further qualitative analysis, several representative sequencesreconstructed by the proposed 3D perception system on real-worldenvironments are illustrated in Fig. 7, including challenging geometricand semantic information in different office rooms and rest rooms. Notethat we use the same network trained on the public dataset for all theseexperiments, which demonstrates that the proposed system generalizeswell to real-world environments and achieves high-quality geometricand semantic reconstruction results.

4.2 Quantitative AnalysisFor quantitative analysis in terms of both efficiency and accuracy, wecompare our method with previous state-of-the-art methods, includingthe point-based methods PointNet [34], PointNet++ [35] as well asthe volumetric methods TangentConv [39], SCN [14] and Minkowsk-iNet [8] (denoted as MKN), in the S3DIS dataset. Note that S3DIScontains 6 large-scale indoor areas from 3 different buildings cover-ing over 6000m2 with 13 object classes. Following previous meth-ods [8, 39], we use Area 5 as testing and the rest as training. Area5 is further divided into 68 rooms and the mean processing time ofall these rooms is recorded for efficiency comparison among differentalgorithms. For accuracy evaluations, we adopt the most popular metricmean Intersection over Union (mIoU) which measures the percentagesof true-positives against the union of true-positives, false-negatives andfalse-positives and averaged along each class.

For simplification, the proposed chunk-based sparse-convolutionnetwork with cross-scale feature fusion layer is denoted as CBSCN.Furthermore, let s∗ represents the resolution of voxels for volumetricbased convolution methods, e.g., s20 for voxels with the size of 1

20 m.The hyper parameter L in Eqn. 5 is set to be 5 and 7 for CBSCN (s20)and CBSCN (s50), respectively. Note that the proposed network istrained using the standard cross-entropy loss for image segmentationand optimized by the Adam optimizer [19] with learning rate 1e-3.

As shown in Tab. 2, we have the following observations:

• Point-based methods (PointNet) achieved the best efficiency, yettheir performance is almost 20% lower that state-of-the-art ap-proaches in mIoU metric, yielding poor performance in segmen-tation results.

• Our proposed CBSCN-1 (s20) achieves comparable efficiencywith PointNet, while being 24.1% higher in accuracy. Comparedwith other methods, MinkowskiNet achieved similar accuracy [8],while being 11× times slower with 4× more parameters.

• Our proposed CBSCN-2 (s50) achieves the best accuracy with acomparable run-time compared with the previous state of the arts,e.g., 5× faster than MinkowskiNet with 2.9% higher mIoU score.

The IoU score of each class is also presented for detailed compar-isons between our approach and previous methods as shown in Tab. 3,

Page 6: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

HAN ET AL.: LIVE SEMANTIC 3D PERCEPTION FOR IMMERSIVE AUGMENTED REALITY 2017

Output locationsInput Region

Chu

nksi

ze

K

Fig. 4: Evaluate the number of input data by input region: For theexample where kernel size K = 3, chunk size C = 4, input region islarger than output region by a margin of size (K −1)/2.

in 2d segmentation task turn out to be redundant in 3D, of which thereason lays in the fact that 3D geometry has pre-separated boundarybetween instances while the projective geometry of 3D world ,i.e. 2Dimage, blends multiple instances along depth dimension and thereforethe boundary requires extra processing.

Considering the specificity mentioned above, we propose our SpatialAttention based Cross-scale Fusion Module. For the ease of illustration,we first define the notation and framework in Sec. 3.2.1, then presentour fusion module in Sec. 3.2.2.

3.2.1 Notation and Framework

Recall that the 3D segmentation problem refers to given voxel grid Vin ∈RH×W×D×C, finding the semantic counterpart Vlabel ∈ RH×W×D×1,where the H×W ×D×C stand for three spatial dimensions and featuredimension respectively. However with the sparse representation, theconceptive 4D tensor is in fact stored and computed in form of chunk-based representation presented in Sec. 3.1. Although we can regardit as dense tensor conceptually, the memory and computing powerconsumption only related to the number of spatial point denoted asP. In this section, the voxel grid refers to the chunk-based feature asshown in Fig. 2 without special explanation.

Using our framework the estimation of semantic label can be ob-tained by:

Vlabel = f (Vin), (4)

where f (·) denotes the proposed network.The proposed network enhance pyramid encoder-decoder architec-

ture with our Cross-scale Fusion module. We first briefly go throughthe architecture and then present more details about the module in thenext subsection.

The encoder of the architecture stacks L layers(from 0 to L−1) ofconv module that follows a extract-downscale manner like commonpractice(see details in Tab. 1). Briefly, given input Vin ∈ RH×W×D×C,the encoder feature Fencoder

i ∈ RH2i ×W

2i × D2i ×Ci is defined as:

Fencoderi =

{f extracti (Vin) i = 0

f extracti ( f downscale

i (Fencoderi−1 )) 1 ≤ i ≤ L−1 (5)

where the f extracti (·) and f downscale

i (·) stands for the feature extract anddownscale module defined in Tab. 1.

The decoder follows a corresponding hierarchical upscale-fusionmanner to generate feature Fdecoder

i−1 in each scale, while, on the contrary,

the fusion module f f usioni takes Fdecoder

j , i ≤ j ≤ L−1 as fusion source.Specifically, the Fdecoder

i is defined as follow:

Fdecoderi−1 =

{f upscalei (Fencoder

i ) i = L−1f upscalei (Fencoder

i )+ f f usioni (Φ) 1 ≤ i ≤ L−2

(6)

… …

Bat

ch

Nor

mal

izat

ion

Dec

onv

Rew

eigh

t

Res

Blo

ck

… FjdecoderFjdecoder Fi

decoderFidecoder…FL−2

decoderFL−2decoder

FidecoderFidecoder

Fi−1decoderFi−1decoder

Spatial Attention Path:(L-2,i-1)

Fi−1encoderFi−1encoderfi

upscale

Fi−1, jenhanced

Spatial Attention Path:(j,i-1)

Spatial Attention Path:(i,i-1)

*

Trilinear

bi−1, jbi−1, jFjdecoderFjdecoder

Ai−1, jAi−1, j

ai−1, jai−1, j

*

Spatial Attention Path:(j,i-1)

fsigmoid

Fi−1, jenhanced

Ele-product

Fi−1,L−2enhanced Fi−1,i

enhanced

… …

( j +1)M( j +1)M

Pj Pi

(i +1)M

PL−2

(L−1)M

Fig. 5: Architecture of Spatial attention based Cross-scale FusionModule. The upper sub-figure shows the overview of fusion from multi-scale paths while the lower sub-figure zooms in on a specific SpatialAttention Path and presents in details.

where Φ = Fdecoderi , ...,Fdecoder

L−1 and f upscalei (·) denotes the deconv

module defined in Tab. 1 . Shrink the channel of Fdecoder0 by linear

transformation along channel, and the result Vlabel can then be usedto optimization. Given a training set {V in

k ,V labelk }N

k=1, the target isto minimize the loss denoted as L = ∑N

k=1 H(V labelk ,V label

k ), where Hdenoted cross entropy function.

3.2.2 Cross-scale Fusion moduleTowards effective yet efficient cross-scale feature fusion, we propose anovel spatial attention based fusion scheme that first re-weight featuregrid from each source scale by exploiting implicit semantic information,then fuse the enhanced feature grid by addition. As mentioned inprevious study [22], the multi-scale pyramidal feature maps are capableof approximate category label. In other words, the Fdecoder

i is capable

of containing sufficient semantic information and can serve as so called”category prior” to further adjust intensity for each source scale feature.

The intuition behind this scheme is to give pyramidal network theflexibility to use different spatial attention distribution in differentsource-to-target path. Although deep network is capable of learningthis non-local transformation by stacking layers, our design helps toachieve a more compact and efficient architecture.

Specifically, as shown in Fig. 5 Cross-scale Fusion module takesmulti-scale features as input to obtain Fdecoder

i−1 ,1 ≤ i ≤ L− 2. For aspecific Fdecoder

j , i ≤ j ≤ L−2, the attention weight is obtain by:

Ai−1, j = fsigmoid(Fdecoderj ∗ai−1, j), (7)

where * denotes the matrix product along channel dimension, ai−1, j ∈RCj×1 is trainable parameter of linear transformation and Ai−1, j ∈R

H2 j × W

2 j × D2 j ×1 denotes the attention weight from scale j to scale i−1,

which is so called source-target path.Then apply the weight to the grid, upsample and merge by addition

as follow:

Fenhancedi−1, j = ftrilinear((F

decoderj ∗bi−1, j)�Ai−1, j), (8)

f f usioni (Φ) =

L−2

∑j=i

Fenhancedi−1, j , (9)

where � denotes element wise product with broadcast, bi−1, j ∈RCj×Cj

is trainable parameter of linear transformation and ftrilinear(·) is trilin-ear upsample function with scale factor 2 j−i+1 in H,W,D dimension.

Table 1: Details of architecture, where M is the channel width ofnetwork and we take M = 32 for all experiments.

Module Layer K S Cin Coutf extracti bn+sub-conv1 3 1 M*(i+1) M*(i+1)

bn+sub-conv2 3 1 M*(i+1) M*(i+1)resnet addition

f downscalei bn1+conv1 2 2 M*(i+1) M*(i+2)

f upscalei bn+deconv1 2 2 M*(i+1) M*i

channel linear1 M*i M*iadditionbn+sub-conv1 3 1 M*i M*ibn+sub-conv2 3 1 M*i M*iresnet addition

4 EXPERIMENTS

In this section, we first report the implementation details of the proposedmethod, followed by the qualitative and quantitative analysis of ourmethod on real-world sequences, as well as the efficiency evaluationof our method. An augmented reality application to demonstrate theeffectiveness of our method is provided in the last subsection.

The proposed approach is evaluated on both public dataset and real-world experiments. For public dataset evaluation, Stanford Large-Scale3D Indoor Space Dataset (S3DIS) [4] is employed, which contains6 large-scale indoor areas from 3 different buildings covering over6000m2 with 13 object classes. All the experiments on S3DIS areconducted on a single NVIDIA 1080Ti GPU and an Intel(R) Xeon(R)E5-2650 CPU. For real world experiments, we use the laptop MicrosoftSurface Book 2 with a NVIDIA GTX 1060 (Mobile) GPU and an IntelCore i7-8650U CPU as computing device. An extra RGBD sensorASUS xtion mounted to the laptop is utilized to captured an input liveRGBD stream in 30 fps with 640×480 resolution.

4.1 Qualitative AnalysisWe compare our method with previous state-of-the-art methods forpoint cloud segmentation, including the point-based method Point-Net [34] and the volumetric method TangentConv [39] on the validation

Table 2: Performance evaluation on S3DIS dataset in terms of bothaccuracy and efficiency.

mIoU (%) Efficiency (s) Parameters (M)PointNet [34] 41.3 0.15 3.6

PointNet++ [35] 52.3 0.86 1.0TangentConv [39] 52.8 0.59 0.33

MKN (s20) [8] 62.6 0.98 37.9MKN (s32) [8] 65.4 2.15 37.9CBSCN (s20) 65.5 0.18 9.6CBSCN (s50) 68.3 0.54 15.3

set of S3DIS [4] dataset. As illustrated in Fig. 6, our method producesless noisy and smoother semantic labels, leading to better semanticsegmentation performance. For example, the ’door’ label in our resultsis cleaner than that in other methods, as is marked in red circles. Addi-tionally, as is marked in blue circles, the ’bookcase’ label also showsbetter accuracy than other methods.

For further qualitative analysis, several representative sequencesreconstructed by the proposed 3D perception system on real-worldenvironments are illustrated in Fig. 7, including challenging geometricand semantic information in different office rooms and rest rooms. Notethat we use the same network trained on the public dataset for all theseexperiments, which demonstrates that the proposed system generalizeswell to real-world environments and achieves high-quality geometricand semantic reconstruction results.

4.2 Quantitative AnalysisFor quantitative analysis in terms of both efficiency and accuracy, wecompare our method with previous state-of-the-art methods, includingthe point-based methods PointNet [34], PointNet++ [35] as well asthe volumetric methods TangentConv [39], SCN [14] and Minkowsk-iNet [8] (denoted as MKN), in the S3DIS dataset. Note that S3DIScontains 6 large-scale indoor areas from 3 different buildings cover-ing over 6000m2 with 13 object classes. Following previous meth-ods [8, 39], we use Area 5 as testing and the rest as training. Area5 is further divided into 68 rooms and the mean processing time ofall these rooms is recorded for efficiency comparison among differentalgorithms. For accuracy evaluations, we adopt the most popular metricmean Intersection over Union (mIoU) which measures the percentagesof true-positives against the union of true-positives, false-negatives andfalse-positives and averaged along each class.

For simplification, the proposed chunk-based sparse-convolutionnetwork with cross-scale feature fusion layer is denoted as CBSCN.Furthermore, let s∗ represents the resolution of voxels for volumetricbased convolution methods, e.g., s20 for voxels with the size of 1

20 m.The hyper parameter L in Eqn. 5 is set to be 5 and 7 for CBSCN (s20)and CBSCN (s50), respectively. Note that the proposed network istrained using the standard cross-entropy loss for image segmentationand optimized by the Adam optimizer [19] with learning rate 1e-3.

As shown in Tab. 2, we have the following observations:

• Point-based methods (PointNet) achieved the best efficiency, yettheir performance is almost 20% lower that state-of-the-art ap-proaches in mIoU metric, yielding poor performance in segmen-tation results.

• Our proposed CBSCN-1 (s20) achieves comparable efficiencywith PointNet, while being 24.1% higher in accuracy. Comparedwith other methods, MinkowskiNet achieved similar accuracy [8],while being 11× times slower with 4× more parameters.

• Our proposed CBSCN-2 (s50) achieves the best accuracy with acomparable run-time compared with the previous state of the arts,e.g., 5× faster than MinkowskiNet with 2.9% higher mIoU score.

The IoU score of each class is also presented for detailed compar-isons between our approach and previous methods as shown in Tab. 3,

Page 7: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

2018 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

Tangent ConvOurs PointNetGround TruthInput

ceiling wall beam, clutter window door table chair sofa bookcasecolumn, boardfloor

Fig. 6: Qualitative results comparison on S3DIS [4]. We randomly pick several scenes of the test set, i.e., Area 5 in S3DIS. We show the followingitems from left to right: input RGB point cloud, ground-truth, semantic prediction from our method, PointNet [34] and TangentConv [39]. Ournetwork predicts more precise semantic labels with a mIoU score of 68.3%, compared with PointNet 41.3% and TangentConv 52.8%. All thelabels are visualized according to NYU-40 coloring.

Col

ored

Mes

hG

eom

etry

Mes

hSe

man

ticL

abel

s

ceiling wall beam, clutter window door table chair sofa bookcasecolumn, boardfloor

Fig. 7: Qualitative results on real-world experiments. We evaluate the proposed 3D perception system in various real-world environmentsincluding different office rooms, rest rooms, etc. The proposed system predicts correct labels in real-world environments using the proposedCBSCN (s20) network trained on public datasets.

which further illustrates that the proposed CBSCN network achievesthe robust segmentation performance on the challenging classes likesofa, window and column.

4.3 Efficiency Evaluations

Note that SCN [13, 14], MinkowskiNet and our proposed CBSCNare volumetric-based 3D convolution methods sharing the same ideaof employing the sparsity of 3D surfaces for reducing the computa-tional complexity and memory consumption. SCN firstly proposesspatially-sparse convolutional networks to exploit the sparsity of highdimensional data by directly employing dictionary based multiplicationfor the implementation of sparse convolution operations. Minkowsk-iNet utilizes sparse tensors [42] for generalized sparse convolution.Our proposed CBSCN further proposes Chunk-Based SCN for effi-cient implementation of sparse convolution functions by exploitingthe neighboring non-empty voxels to reduce the random global mem-ory access on parallel computing devices. Thus, we compare thesethree different sparse convolution schemes under the same networkstructure from [8] at scale 20 and scale 32. Experiments are evaluatedusing the average processing time of each room in the validation set(Area 5) of S3DIS [4] dataset. As shown in Tab. 4, the proposed CB-SCN is 4× faster than MinkowskiNet and 5× faster than the originalimplementation of SCN [14].

In terms of training, we employ the same chunk-based method bothin the forward and backward processes. Therefore, the training stageis also benefited from our chunk-based convolution method. Morespecifically, our backward is 2× faster than the original SCN [14].

Furthermore, we evaluate the efficiency of our method in terms ofscalability as shown in Fig. 8. The reconstructed 3D model becomeslarger when we have more observations and 3D semantic segmenta-tion is executed when the geometry model is updated. As shown inFig. 8, the inference time of the proposed adaptive chunk-based sparseconvolutional network ranges from 50 ms to 250 ms while the originalSCN [14] exceeds 2000 ms for large point clouds which is not suitablefor online semantic 3D perception.

Fig. 8: Inference time comparisons between the original SCN and ourproposed adaptive chunk-based SCN when reconstructing the environ-ment. The reconstructed 3D model become larger when we have moreobservations and the inference time of both the original SCN and theproposed CBSCN grow linearly with the number of points in the 3Dmodel.

4.4 Augmented Reality Experiments

The presented 3D perception system provides both the geometric modeland semantic understanding of the environment, which serve as thefundamental components of various unmanned systems and augmentedreality applications. In this section, we present an augmented realityapplication based on the presented perception system to demonstratethe effectiveness of our proposed approach for real-time 3D perception.In addition, the system will be made publicly available as a fundamental

Page 8: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

HAN ET AL.: LIVE SEMANTIC 3D PERCEPTION FOR IMMERSIVE AUGMENTED REALITY 2019

Tangent ConvOurs PointNetGround TruthInput

ceiling wall beam, clutter window door table chair sofa bookcasecolumn, boardfloor

Fig. 6: Qualitative results comparison on S3DIS [4]. We randomly pick several scenes of the test set, i.e., Area 5 in S3DIS. We show the followingitems from left to right: input RGB point cloud, ground-truth, semantic prediction from our method, PointNet [34] and TangentConv [39]. Ournetwork predicts more precise semantic labels with a mIoU score of 68.3%, compared with PointNet 41.3% and TangentConv 52.8%. All thelabels are visualized according to NYU-40 coloring.

Col

ored

Mes

hG

eom

etry

Mes

hSe

man

ticL

abel

s

ceiling wall beam, clutter window door table chair sofa bookcasecolumn, boardfloor

Fig. 7: Qualitative results on real-world experiments. We evaluate the proposed 3D perception system in various real-world environmentsincluding different office rooms, rest rooms, etc. The proposed system predicts correct labels in real-world environments using the proposedCBSCN (s20) network trained on public datasets.

which further illustrates that the proposed CBSCN network achievesthe robust segmentation performance on the challenging classes likesofa, window and column.

4.3 Efficiency Evaluations

Note that SCN [13, 14], MinkowskiNet and our proposed CBSCNare volumetric-based 3D convolution methods sharing the same ideaof employing the sparsity of 3D surfaces for reducing the computa-tional complexity and memory consumption. SCN firstly proposesspatially-sparse convolutional networks to exploit the sparsity of highdimensional data by directly employing dictionary based multiplicationfor the implementation of sparse convolution operations. Minkowsk-iNet utilizes sparse tensors [42] for generalized sparse convolution.Our proposed CBSCN further proposes Chunk-Based SCN for effi-cient implementation of sparse convolution functions by exploitingthe neighboring non-empty voxels to reduce the random global mem-ory access on parallel computing devices. Thus, we compare thesethree different sparse convolution schemes under the same networkstructure from [8] at scale 20 and scale 32. Experiments are evaluatedusing the average processing time of each room in the validation set(Area 5) of S3DIS [4] dataset. As shown in Tab. 4, the proposed CB-SCN is 4× faster than MinkowskiNet and 5× faster than the originalimplementation of SCN [14].

In terms of training, we employ the same chunk-based method bothin the forward and backward processes. Therefore, the training stageis also benefited from our chunk-based convolution method. Morespecifically, our backward is 2× faster than the original SCN [14].

Furthermore, we evaluate the efficiency of our method in terms ofscalability as shown in Fig. 8. The reconstructed 3D model becomeslarger when we have more observations and 3D semantic segmenta-tion is executed when the geometry model is updated. As shown inFig. 8, the inference time of the proposed adaptive chunk-based sparseconvolutional network ranges from 50 ms to 250 ms while the originalSCN [14] exceeds 2000 ms for large point clouds which is not suitablefor online semantic 3D perception.

Fig. 8: Inference time comparisons between the original SCN and ourproposed adaptive chunk-based SCN when reconstructing the environ-ment. The reconstructed 3D model become larger when we have moreobservations and the inference time of both the original SCN and theproposed CBSCN grow linearly with the number of points in the 3Dmodel.

4.4 Augmented Reality Experiments

The presented 3D perception system provides both the geometric modeland semantic understanding of the environment, which serve as thefundamental components of various unmanned systems and augmentedreality applications. In this section, we present an augmented realityapplication based on the presented perception system to demonstratethe effectiveness of our proposed approach for real-time 3D perception.In addition, the system will be made publicly available as a fundamental

Page 9: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

2020 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

Table 3: Per class IoU on the whole S3DIS Dataset. Our methods achieves the best semantic segmentation performance on the challenging classeslike sofa, window and column.

Method ceiling floor wall beam clmn window door chair table bkcase sofa board clutter mIoUPointNet 88.8 97.3 69.8 0.1 3.9 46.3 10.8 52.6 58.9 40.3 5.6 26.4 33.2 41.1

TangentConv 90.5 97.7 74.0 0.0 20.7 39.0 31.3 77.5 69.4 57.3 38.5 48.8 39.8 52.8MKN (s20) 91.6 98.5 85.0 0.8 26.5 46.2 55.8 89.0 80.5 71.7 48.3 63.0 57.7 62.6MKN (s32) 91.8 98.7 86.2 0.0 34.1 48.9 62.4 89.8 81.6 74.9 47.2 74.4 58.6 65.4

CBSCN (s20) 92.8 96.5 83.3 0.1 39.8 48.3 72.0 75.5 85.8 53.9 70.5 75.9 57.8 65.6CBSCN (s50) 92.7 97.5 85.6 0.2 44 58.9 73.0 78.4 85.5 63.0 73.1 77.5 58.9 68.34

Table 4: Efficiency comparisons under the same network structure. OurCBSCN outperforms the other sparse convolution schemes in terms ofefficiency.

scale = 20 (ms) scale = 32 (ms)MKN 980 2150

Original SCN 1267 2162.2CBSCN 242.1 402.3

Table 5: Running time of each component of the proposed system inreal world experiments using only a portable device.

Step Run-time (ms)Frame Tracking 36.1

Global Pose Optimization 14.5Dense 3D Reconstruction 241.1Semantic Segmentation 108.6

platform for the community to develop immersive VR/AR applications.More results on real-world experiments will also be presented in thepresented supplementary material. Note that the models used for se-mantic understanding are only trained on public datasets without furtherfine-tuning.

As shown in Fig. 1, we use a mobile device (Surface Book 2) to scanthe environment and interact with it through the RGB frames capturedby a RGBD sensor (ASUS xtion). All computations are implementedlocally on the mobile device thanks to the proposed efficient 3D per-ception system. More specifically, while the user is holding the deviceand scanning the environment, the geometric and semantic model ofthe environment is reconstructed in the background. And in the UserInterface, either the rendered model or RGB frame can be displayedon the screen and the user can touch the screen to throw a syntheticball with a fixed initial speed into the virtual environment which isalso rendered and displayed on the screen. Ideally, the ball’s trajectoryis a parabola controlled by both its initial speed and the gravity. Byutilizing the reconstructed geometric model of the environment, wemay observe that the ball bounces back when it come against objectsurfaces. By further utilizing the semantic labeling of each 3D voxel,we could further observe that the bounce is different for various objects,e.g., softer on the sofa and stronger on the desk as they are composed ofdifferent materials. We use the physics engine in Unity 3D for collisiondetection and the animation of the moving balls. The peak running timeof each component is shown in Tab. 5 for more detailed analysis on theefficiency of our presented system. Frame tracking is activated in frontend at frame rate whenever a new frame is captured (approximately20Hz in our experiments). To reduce the tracking drift of visual obser-vations, global pose optimization are applied when a new key frame isinserted (a key frame is selected every 10 frame). Based on the esti-mated camera pose, geometric model are reconstructed incrementallyand the whole 3D map are fed into the proposed chunk-based networkstructure for semantic understanding. 3D model with semantic labelsare updated at key frame rate which is sent to the physics engine foraugmented reality applications.

5 LIMITATIONS AND FUTURE WORK

The proposed 3D perception system achieves robust geometric model-ing and semantic segmentation while merely requires a mobile laptop

(a) Before update (b) After update

Fig. 9: Inconsistency of incomplete model. The predictions of deskand door(in blue circle) become more accurate when the geometry isupdated in red circle region(from (a) to (b)).

for computing, it still has limitations that could be further improved.As shown in Fig. 9, the semantic perception module predicts wronglabels for the desk and door at the early stage of the reconstructionprocess, and produces accurate predictions when the 3D model is morecomplete. Our assumption is that the network is only trained with 3Dmodels that are complete scene of the environment, with strong priorslike sofa should be close to the wall and chairs are always connectedto the floor. Attack this problem using regularization technique whentraining the network to get robust semantic segmentation results evenwith partial observations of the environment might be helpful to furtherimprove the proposed system and will be left as future work.

6 CONCLUSION

We presented a real-time 3D geometric and semantic perception methodfor indoor scene, only using a single mobile device. Our adaptivechunk-based sparse convolution technique as well as a novel cross-scale feature fusion module achieved 11× speed up for 3D semanticperception with comparable accuracy performance of previous state-of-the-art approaches. A human-involved live AR demo is further providedto validate our real-time geometric and semantic 3D perception method.We have conducted extensive experiments to evaluate the effectivenessand robustness of our method to reconstruct the geometric and semanticinformation of the 3D scene. We believe that it is a significant steptowards convenient and robust geometric and semantic perception ofthe 3D world, which will enable many potential VR/AR applications.

ACKNOWLEDGMENTS

This work is supported in part by Natural Science Foundation ofChina (NSFC) under contract No. 61722209 and 6181001011, inpart by Shenzhen Science and Technology Research and Develop-ment Funds (JCYJ20180507183706645).

REFERENCES

[1] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher,J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. ACMTransactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2009),28(5), Dec. 2009.

[2] H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother.Augmented reality meets computer vision: Efficient data generation for

urban driving scenes. International Journal of Computer Vision (IJCV),2018.

[3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, andS. Savarese. 3d semantic parsing of large-scale indoor spaces. In Pro-ceedings of the IEEE International Conference on Computer Vision andPattern Recognition, 2016.

[4] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, andS. Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1534–1543, 2016.

[5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li,S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[6] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille.Semantic image segmentation with task-specific edge detection using cnnsand a discriminatively trained domain transform. In 2016 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 4545–4554,2016.

[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille.Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs. IEEE transactions on patternanalysis and machine intelligence, 40(4):834–848, 2017.

[8] C. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal con-vnets: Minkowski convolutional neural networks. arXiv preprintarXiv:1904.08755, 2019.

[9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semanticurban scene understanding. In Proc. of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016.

[10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner.Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 5828–5839, 2017.

[11] A. Dai and M. Nießner. 3dmv: Joint 3d-multi-view prediction for 3dsemantic scene segmentation. In Proceedings of the European Conferenceon Computer Vision (ECCV), pages 452–468, 2018.

[12] A. Dai, M. Nießner, M. Zollhofer, S. Izadi, and C. Theobalt. Bundlefusion:Real-time globally consistent 3d reconstruction using on-the-fly surfacereintegration. ACM Transactions on Graphics (TOG), 36(3):24, 2017.

[13] B. Graham. Spatially-sparse convolutional neural networks. arXiv preprintarXiv:1409.6070, 2014.

[14] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmenta-tion with submanifold sparse convolutional networks. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages9224–9232, 2018.

[15] L. Han and L. Fang. Flashfusion: Real-time globally consistent dense 3dreconstruction using cpu computing. In Robotics: Science and Systems,2018.

[16] L. Han, L. Xu, D. Bobkov, E. Steinbach, and L. Fang. Real-time globalregistration for globally consistent rgbd slam. IEEE Transactions onRobotics, 35(2):498–508, 2019.

[17] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In Proceedingsof the IEEE international conference on computer vision, pages 2961–2969, 2017.

[18] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3d shape seg-mentation with projective convolutional networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages3779–3788, 2017.

[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.

[20] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks forthe recognition of 3d point cloud models. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 863–872, 2017.

[21] J. Li, B. M. Chen, and G. Hee Lee. So-net: Self-organizing network forpoint cloud analysis. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 9397–9406, 2018.

[22] T.-Y. Lin, P. Dollar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie.Feature pyramid networks for object detection. In 2017 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 936–944,2017.

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networksfor semantic segmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 3431–3440, 2015.

[24] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d sur-face construction algorithm. SIGGRAPH Computer Graphics, 21(4):163–

169, 1987.[25] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and

U. Stilla. Classification with an edge: Improving semantic image segmen-tation with boundary detection. Isprs Journal of Photogrammetry andRemote Sensing, 135:158–172, 2018.

[26] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural networkfor real-time object recognition. In 2015 IEEE/RSJ International Con-ference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE,2015.

[27] J. McCormac, A. Handa, A. Davison, and S. Leutenegger. Semanticfusion:Dense 3d semantic mapping with convolutional neural networks. In 2017IEEE International Conference on Robotics and automation (ICRA), pages4628–4635. IEEE, 2017.

[28] Y. Nakajima, K. Tateno, F. Tombari, and H. Saito. Fast and accuratesemantic mapping through geometric-based incremental segmentation.In 2018 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS), pages 385–392. IEEE, 2018.

[29] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji. Panopticfusion: Onlinevolumetric semantic mapping at the level of stuff and things. arXiv preprintarXiv:1903.01177, 2019.

[30] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion:Real-time dense surface mapping and tracking. In 2011 10th IEEE Inter-national Symposium on Mixed and Augmented Reality, pages 127–136,2011.

[31] M. Nießner, M. Zollhofer, S. Izadi, and Stamminger. Real-time 3d recon-struction at scale using voxel hashing. ACM Transactions on Graphics(TOG), 32(6):169, 2013.

[32] H. Noh, S. Hong, and B. Han. Learning deconvolution network for seman-tic segmentation. In Proceedings of the IEEE international conference oncomputer vision, pages 1520–1528, 2015.

[33] Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung. Real-time pro-gressive 3d semantic segmentation for indoor scenes. In 2019 IEEEWinter Conference on Applications of Computer Vision (WACV), pages1089–1098. IEEE, 2019.

[34] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on pointsets for 3d classification and segmentation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 652–660,2017.

[35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchicalfeature learning on point sets in a metric space. In Advances in neuralinformation processing systems, pages 5099–5108, 2017.

[36] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun. Sbnet: Sparse blocksnetwork for fast inference. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 8711–8720, 2018.

[37] G. Riegler, A. Osman Ulusoy, and A. Geiger. Octnet: Learning deep 3drepresentations at high resolutions. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3577–3586, 2017.

[38] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convo-lutional neural networks for 3d shape recognition. In Proceedings of theIEEE international conference on computer vision, pages 945–953, 2015.

[39] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou. Tangent convolutionsfor dense prediction in 3d. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 3887–3896, 2018.

[40] K. Tateno, F. Tombari, and N. Navab. Real-time and scalable incrementalsegmentation on dense slam. In 2015 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), pages 4465–4472. IEEE, 2015.

[41] M. Teschner, B. Heidelberger, M. Muller, D. Pomerantes, and M. H. Gross.Optimized spatial hashing for collision detection of deformable objects.In Vmv, volume 3, pages 47–54, 2003.

[42] P. A. Tew. An investigation of sparse tensor formats for tensor libraries.PhD thesis, Massachusetts Institute of Technology, 2016.

[43] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J.Guibas. Kpconv: Flexible and deformable convolution for point clouds.arXiv preprint arXiv:1904.08889, 2019.

[44] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison.Elasticfusion: Dense slam without a pose graph. In Robotics: Science andSystems, 2015.

[45] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3dshapenets: A deep representation for volumetric shapes. In Proceedingsof the IEEE conference on computer vision and pattern recognition, pages1912–1920, 2015.

[46] L. Xu, Z. Su, L. Han, T. Yu, Y. Liu, and L. FANG. Unstructuredfusion:Realtime 4d geometry and texture reconstruction using commercialrgbdcameras. IEEE Transactions on Pattern Analysis and Machine Intelligence,

Page 10: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

HAN ET AL.: LIVE SEMANTIC 3D PERCEPTION FOR IMMERSIVE AUGMENTED REALITY 2021

Table 3: Per class IoU on the whole S3DIS Dataset. Our methods achieves the best semantic segmentation performance on the challenging classeslike sofa, window and column.

Method ceiling floor wall beam clmn window door chair table bkcase sofa board clutter mIoUPointNet 88.8 97.3 69.8 0.1 3.9 46.3 10.8 52.6 58.9 40.3 5.6 26.4 33.2 41.1

TangentConv 90.5 97.7 74.0 0.0 20.7 39.0 31.3 77.5 69.4 57.3 38.5 48.8 39.8 52.8MKN (s20) 91.6 98.5 85.0 0.8 26.5 46.2 55.8 89.0 80.5 71.7 48.3 63.0 57.7 62.6MKN (s32) 91.8 98.7 86.2 0.0 34.1 48.9 62.4 89.8 81.6 74.9 47.2 74.4 58.6 65.4

CBSCN (s20) 92.8 96.5 83.3 0.1 39.8 48.3 72.0 75.5 85.8 53.9 70.5 75.9 57.8 65.6CBSCN (s50) 92.7 97.5 85.6 0.2 44 58.9 73.0 78.4 85.5 63.0 73.1 77.5 58.9 68.34

Table 4: Efficiency comparisons under the same network structure. OurCBSCN outperforms the other sparse convolution schemes in terms ofefficiency.

scale = 20 (ms) scale = 32 (ms)MKN 980 2150

Original SCN 1267 2162.2CBSCN 242.1 402.3

Table 5: Running time of each component of the proposed system inreal world experiments using only a portable device.

Step Run-time (ms)Frame Tracking 36.1

Global Pose Optimization 14.5Dense 3D Reconstruction 241.1Semantic Segmentation 108.6

platform for the community to develop immersive VR/AR applications.More results on real-world experiments will also be presented in thepresented supplementary material. Note that the models used for se-mantic understanding are only trained on public datasets without furtherfine-tuning.

As shown in Fig. 1, we use a mobile device (Surface Book 2) to scanthe environment and interact with it through the RGB frames capturedby a RGBD sensor (ASUS xtion). All computations are implementedlocally on the mobile device thanks to the proposed efficient 3D per-ception system. More specifically, while the user is holding the deviceand scanning the environment, the geometric and semantic model ofthe environment is reconstructed in the background. And in the UserInterface, either the rendered model or RGB frame can be displayedon the screen and the user can touch the screen to throw a syntheticball with a fixed initial speed into the virtual environment which isalso rendered and displayed on the screen. Ideally, the ball’s trajectoryis a parabola controlled by both its initial speed and the gravity. Byutilizing the reconstructed geometric model of the environment, wemay observe that the ball bounces back when it come against objectsurfaces. By further utilizing the semantic labeling of each 3D voxel,we could further observe that the bounce is different for various objects,e.g., softer on the sofa and stronger on the desk as they are composed ofdifferent materials. We use the physics engine in Unity 3D for collisiondetection and the animation of the moving balls. The peak running timeof each component is shown in Tab. 5 for more detailed analysis on theefficiency of our presented system. Frame tracking is activated in frontend at frame rate whenever a new frame is captured (approximately20Hz in our experiments). To reduce the tracking drift of visual obser-vations, global pose optimization are applied when a new key frame isinserted (a key frame is selected every 10 frame). Based on the esti-mated camera pose, geometric model are reconstructed incrementallyand the whole 3D map are fed into the proposed chunk-based networkstructure for semantic understanding. 3D model with semantic labelsare updated at key frame rate which is sent to the physics engine foraugmented reality applications.

5 LIMITATIONS AND FUTURE WORK

The proposed 3D perception system achieves robust geometric model-ing and semantic segmentation while merely requires a mobile laptop

(a) Before update (b) After update

Fig. 9: Inconsistency of incomplete model. The predictions of deskand door(in blue circle) become more accurate when the geometry isupdated in red circle region(from (a) to (b)).

for computing, it still has limitations that could be further improved.As shown in Fig. 9, the semantic perception module predicts wronglabels for the desk and door at the early stage of the reconstructionprocess, and produces accurate predictions when the 3D model is morecomplete. Our assumption is that the network is only trained with 3Dmodels that are complete scene of the environment, with strong priorslike sofa should be close to the wall and chairs are always connectedto the floor. Attack this problem using regularization technique whentraining the network to get robust semantic segmentation results evenwith partial observations of the environment might be helpful to furtherimprove the proposed system and will be left as future work.

6 CONCLUSION

We presented a real-time 3D geometric and semantic perception methodfor indoor scene, only using a single mobile device. Our adaptivechunk-based sparse convolution technique as well as a novel cross-scale feature fusion module achieved 11× speed up for 3D semanticperception with comparable accuracy performance of previous state-of-the-art approaches. A human-involved live AR demo is further providedto validate our real-time geometric and semantic 3D perception method.We have conducted extensive experiments to evaluate the effectivenessand robustness of our method to reconstruct the geometric and semanticinformation of the 3D scene. We believe that it is a significant steptowards convenient and robust geometric and semantic perception ofthe 3D world, which will enable many potential VR/AR applications.

ACKNOWLEDGMENTS

This work is supported in part by Natural Science Foundation ofChina (NSFC) under contract No. 61722209 and 6181001011, inpart by Shenzhen Science and Technology Research and Develop-ment Funds (JCYJ20180507183706645).

REFERENCES

[1] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher,J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. ACMTransactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2009),28(5), Dec. 2009.

[2] H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother.Augmented reality meets computer vision: Efficient data generation for

urban driving scenes. International Journal of Computer Vision (IJCV),2018.

[3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, andS. Savarese. 3d semantic parsing of large-scale indoor spaces. In Pro-ceedings of the IEEE International Conference on Computer Vision andPattern Recognition, 2016.

[4] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, andS. Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1534–1543, 2016.

[5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li,S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[6] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille.Semantic image segmentation with task-specific edge detection using cnnsand a discriminatively trained domain transform. In 2016 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 4545–4554,2016.

[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille.Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs. IEEE transactions on patternanalysis and machine intelligence, 40(4):834–848, 2017.

[8] C. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal con-vnets: Minkowski convolutional neural networks. arXiv preprintarXiv:1904.08755, 2019.

[9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semanticurban scene understanding. In Proc. of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016.

[10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner.Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 5828–5839, 2017.

[11] A. Dai and M. Nießner. 3dmv: Joint 3d-multi-view prediction for 3dsemantic scene segmentation. In Proceedings of the European Conferenceon Computer Vision (ECCV), pages 452–468, 2018.

[12] A. Dai, M. Nießner, M. Zollhofer, S. Izadi, and C. Theobalt. Bundlefusion:Real-time globally consistent 3d reconstruction using on-the-fly surfacereintegration. ACM Transactions on Graphics (TOG), 36(3):24, 2017.

[13] B. Graham. Spatially-sparse convolutional neural networks. arXiv preprintarXiv:1409.6070, 2014.

[14] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmenta-tion with submanifold sparse convolutional networks. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages9224–9232, 2018.

[15] L. Han and L. Fang. Flashfusion: Real-time globally consistent dense 3dreconstruction using cpu computing. In Robotics: Science and Systems,2018.

[16] L. Han, L. Xu, D. Bobkov, E. Steinbach, and L. Fang. Real-time globalregistration for globally consistent rgbd slam. IEEE Transactions onRobotics, 35(2):498–508, 2019.

[17] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In Proceedingsof the IEEE international conference on computer vision, pages 2961–2969, 2017.

[18] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3d shape seg-mentation with projective convolutional networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages3779–3788, 2017.

[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.

[20] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks forthe recognition of 3d point cloud models. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 863–872, 2017.

[21] J. Li, B. M. Chen, and G. Hee Lee. So-net: Self-organizing network forpoint cloud analysis. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 9397–9406, 2018.

[22] T.-Y. Lin, P. Dollar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie.Feature pyramid networks for object detection. In 2017 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 936–944,2017.

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networksfor semantic segmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 3431–3440, 2015.

[24] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d sur-face construction algorithm. SIGGRAPH Computer Graphics, 21(4):163–

169, 1987.[25] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and

U. Stilla. Classification with an edge: Improving semantic image segmen-tation with boundary detection. Isprs Journal of Photogrammetry andRemote Sensing, 135:158–172, 2018.

[26] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural networkfor real-time object recognition. In 2015 IEEE/RSJ International Con-ference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE,2015.

[27] J. McCormac, A. Handa, A. Davison, and S. Leutenegger. Semanticfusion:Dense 3d semantic mapping with convolutional neural networks. In 2017IEEE International Conference on Robotics and automation (ICRA), pages4628–4635. IEEE, 2017.

[28] Y. Nakajima, K. Tateno, F. Tombari, and H. Saito. Fast and accuratesemantic mapping through geometric-based incremental segmentation.In 2018 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS), pages 385–392. IEEE, 2018.

[29] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji. Panopticfusion: Onlinevolumetric semantic mapping at the level of stuff and things. arXiv preprintarXiv:1903.01177, 2019.

[30] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion:Real-time dense surface mapping and tracking. In 2011 10th IEEE Inter-national Symposium on Mixed and Augmented Reality, pages 127–136,2011.

[31] M. Nießner, M. Zollhofer, S. Izadi, and Stamminger. Real-time 3d recon-struction at scale using voxel hashing. ACM Transactions on Graphics(TOG), 32(6):169, 2013.

[32] H. Noh, S. Hong, and B. Han. Learning deconvolution network for seman-tic segmentation. In Proceedings of the IEEE international conference oncomputer vision, pages 1520–1528, 2015.

[33] Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung. Real-time pro-gressive 3d semantic segmentation for indoor scenes. In 2019 IEEEWinter Conference on Applications of Computer Vision (WACV), pages1089–1098. IEEE, 2019.

[34] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on pointsets for 3d classification and segmentation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 652–660,2017.

[35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchicalfeature learning on point sets in a metric space. In Advances in neuralinformation processing systems, pages 5099–5108, 2017.

[36] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun. Sbnet: Sparse blocksnetwork for fast inference. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 8711–8720, 2018.

[37] G. Riegler, A. Osman Ulusoy, and A. Geiger. Octnet: Learning deep 3drepresentations at high resolutions. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3577–3586, 2017.

[38] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convo-lutional neural networks for 3d shape recognition. In Proceedings of theIEEE international conference on computer vision, pages 945–953, 2015.

[39] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou. Tangent convolutionsfor dense prediction in 3d. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 3887–3896, 2018.

[40] K. Tateno, F. Tombari, and N. Navab. Real-time and scalable incrementalsegmentation on dense slam. In 2015 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), pages 4465–4472. IEEE, 2015.

[41] M. Teschner, B. Heidelberger, M. Muller, D. Pomerantes, and M. H. Gross.Optimized spatial hashing for collision detection of deformable objects.In Vmv, volume 3, pages 47–54, 2003.

[42] P. A. Tew. An investigation of sparse tensor formats for tensor libraries.PhD thesis, Massachusetts Institute of Technology, 2016.

[43] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J.Guibas. Kpconv: Flexible and deformable convolution for point clouds.arXiv preprint arXiv:1904.08889, 2019.

[44] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison.Elasticfusion: Dense slam without a pose graph. In Robotics: Science andSystems, 2015.

[45] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3dshapenets: A deep representation for volumetric shapes. In Proceedingsof the IEEE conference on computer vision and pattern recognition, pages1912–1920, 2015.

[46] L. Xu, Z. Su, L. Han, T. Yu, Y. Liu, and L. FANG. Unstructuredfusion:Realtime 4d geometry and texture reconstruction using commercialrgbdcameras. IEEE Transactions on Pattern Analysis and Machine Intelligence,

Page 11: Live Semantic 3D Perception for Immersive Augmented Reality...3D point cloud data, makes high resolution 3D convolutional neural networks tractable with state-of-the-art results on

2022 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 26, NO. 5, MAY 2020

pages 1–1, 2019.[47] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolu-

tions. CoRR, abs/1511.07122, 2015.[48] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based

3d object detection. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4490–4499, 2018.