point cloud learning with transformer

10
Point Cloud Learning with Transformer Xian-Feng Han College of Computer and Information Science Southwest University [email protected] Yu-Jia Kuang College of Computer and Information Science Southwest University Guo-Qiang Xiao College of Computer and Information Science Southwest University [email protected] Abstract Remarkable performance from Transformer networks in Natural Language Processing promote the development of these models in dealing with computer vision tasks such as image recognition and segmentation. In this paper, we in- troduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT) that works directly on the ir- regular point clouds for representation learning. Specifi- cally, a point pyramid transformer is investigated to model features with diverse resolutions or scales we defined, fol- lowed by a multi-level transformer module to aggregate contextual information from different levels of each scale and enhance their interactions. While a multi-scale trans- former module is designed to capture the dependencies among representations across different scales. Extensive evaluation on public benchmark datasets demonstrate the effectiveness and the competitive performance of our meth- ods on 3D shape classification, part segmentation and se- mantic segmentation tasks. 1. Introduction Recently, point cloud analysis has been drawing more and more attention, since point cloud, becoming a preferred representation for tasks of classification and segmentation, can provide much richer geometric as well as photomet- ric information in comparison with 2D images. Specif- ically, the outstanding success of deep learning strategies further make the task of 3D point cloud analysis achieve re- markable advancements in a diverse range of applications, such as autonomous driving [8][9], robotics [25][40], vir- tual/augmented reality [7][11]. However, effective and effi- cient feature learning from point clouds is still a challenge problem due to the irregular, unordered and sparse nature of point clouds. To tackle such crucial challenges, many state-of-the-arts works focus on transforming the unstructured point cloud into either voxel grids [24] or multi-view images [33]. Al- though impressive progress has been made, these types of regular representation inevitably give rise to loss of under- lying geometric information during transformation, as well as high computation cost and memory consumption. The appearance of point-wise methods, such as PointNet [26], has revolutionized point cloud learning. These approaches directly process the raw point clouds by adopting shared Multi-Layer Perceptrons (MLP) [27] or defining convo- lutional kernels [36] or constructing graph [17].However, most existing approaches may not be effective enough to learn context-dependent representation for point clouds. In this work, we propose a novel point-based trans- former architecture, named Multi-level multi-scale trans- former (MLMST) following the tremendous achievement by transformer models in the fields of Natural Language Processing (NLP) and 2D Computer Vision. Actually, transformer provide an idea strategy to model relation- ships between points, since it is permutation invariant and highly expressive as convolution. As shown in Figure 1, our MLMST mainly consists of three carefully-designed modules: (1) a point pyramid transformer (PPT), captur- ing context information from different resolution or recep- tive fields. (2) a multi-level transformer (MLT), learning the cross-level feature interaction to further aggregate geo- metric and semantic information and (3) a multi-scale trans- former (MST), modeling the context interaction across dif- ferent scales to improve the expressive capability. Based on these three core modules, we can report that our MLMST is able to better capture the long-range contextual depen- dencies from different levels and scales in an end-to-end manner. arXiv:2104.13636v1 [cs.CV] 28 Apr 2021

Upload: others

Post on 20-Apr-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Point Cloud Learning with Transformer

Point Cloud Learning with Transformer

Xian-Feng HanCollege of Computer and Information Science

Southwest [email protected]

Yu-Jia KuangCollege of Computer and Information Science

Southwest University

Guo-Qiang XiaoCollege of Computer and Information Science

Southwest [email protected]

Abstract

Remarkable performance from Transformer networks inNatural Language Processing promote the development ofthese models in dealing with computer vision tasks such asimage recognition and segmentation. In this paper, we in-troduce a novel framework, called Multi-level Multi-scalePoint Transformer (MLMSPT) that works directly on the ir-regular point clouds for representation learning. Specifi-cally, a point pyramid transformer is investigated to modelfeatures with diverse resolutions or scales we defined, fol-lowed by a multi-level transformer module to aggregatecontextual information from different levels of each scaleand enhance their interactions. While a multi-scale trans-former module is designed to capture the dependenciesamong representations across different scales. Extensiveevaluation on public benchmark datasets demonstrate theeffectiveness and the competitive performance of our meth-ods on 3D shape classification, part segmentation and se-mantic segmentation tasks.

1. IntroductionRecently, point cloud analysis has been drawing more

and more attention, since point cloud, becoming a preferredrepresentation for tasks of classification and segmentation,can provide much richer geometric as well as photomet-ric information in comparison with 2D images. Specif-ically, the outstanding success of deep learning strategiesfurther make the task of 3D point cloud analysis achieve re-markable advancements in a diverse range of applications,such as autonomous driving [8][9], robotics [25][40], vir-tual/augmented reality [7][11]. However, effective and effi-cient feature learning from point clouds is still a challengeproblem due to the irregular, unordered and sparse nature of

point clouds.To tackle such crucial challenges, many state-of-the-arts

works focus on transforming the unstructured point cloudinto either voxel grids [24] or multi-view images [33]. Al-though impressive progress has been made, these types ofregular representation inevitably give rise to loss of under-lying geometric information during transformation, as wellas high computation cost and memory consumption. Theappearance of point-wise methods, such as PointNet [26],has revolutionized point cloud learning. These approachesdirectly process the raw point clouds by adopting sharedMulti-Layer Perceptrons (MLP) [27] or defining convo-lutional kernels [36] or constructing graph [17].However,most existing approaches may not be effective enough tolearn context-dependent representation for point clouds.

In this work, we propose a novel point-based trans-former architecture, named Multi-level multi-scale trans-former (MLMST) following the tremendous achievementby transformer models in the fields of Natural LanguageProcessing (NLP) and 2D Computer Vision. Actually,transformer provide an idea strategy to model relation-ships between points, since it is permutation invariant andhighly expressive as convolution. As shown in Figure 1,our MLMST mainly consists of three carefully-designedmodules: (1) a point pyramid transformer (PPT), captur-ing context information from different resolution or recep-tive fields. (2) a multi-level transformer (MLT), learningthe cross-level feature interaction to further aggregate geo-metric and semantic information and (3) a multi-scale trans-former (MST), modeling the context interaction across dif-ferent scales to improve the expressive capability. Based onthese three core modules, we can report that our MLMSTis able to better capture the long-range contextual depen-dencies from different levels and scales in an end-to-endmanner.

arX

iv:2

104.

1363

6v1

[cs

.CV

] 2

8 A

pr 2

021

Page 2: Point Cloud Learning with Transformer

We evaluate the effectiveness and representation capa-bility of our network on several public benchmark datasets(e.g., ModelNet [42], ShapNet [47] and S3DIS [1]) for 3Dpoint cloud classification and segmentation tasks. Exten-sive results show that MLMST can achieve highly strikingperformance comparable to state-of-the-art methods.

In summary, the main contributions of our work are asfollows:

• We design a novel point pyramid transformer, a multi-level transformer and a multi-scale transformer to cap-ture the cross-level and cross-scale context-aware fea-ture interaction to improve the discriminative power oflearned representation.

• Based on these three modules, we construct an end-to-end network architecture, named Multi-level Multi-scale Point Transformer (MLMSPT), taking the un-structured point clouds as input for highly effective ge-ometric and semantic feature learning.

• Extensive experiments on challenging benchmarksdemonstrate that our MLMSPT model presents state-of-the-art performance on the tasks of 3D object clas-sification, part segmentation as well as semantic seg-mentation based on point clouds.

2. Related Work2.1. Deep learning on Point Clouds

Recently, increasing attention has been paid to designdeep neural networks for 3D point cloud learning, whichachieve the state-of-the-art performance in the applicationsof 3D object classification [6], part segmentation [21] aswell as semantic segmentation [51]. In this section, we pro-vide a brief review of these present approaches that specifi-cally can be categorized into four groups:

Voxel based Methods [15] attempt to voxelize the un-structured point clouds into regular volumetric grid struc-ture, so that the standard 3D convolutional neural networkscan be directly applied similarity as the image to learn de-scriptors. For example, VoxNet [24] is a milestone towardsreal 3D learning. However, these approaches have difficultyin capturing high resolution or fine-grained features due tothe sparsity, loss of geometric information during raster-ization, as well as expensive memory and computationalconsumption [25]. Different efforts later have been madeto alleviate this problem. OctNet [30] represents the pointclouds using a hybrid grid-octree structure which is appli-cable to high resolution inputs of size 256×256×256. AndKd-tree [13] is another structure that can also be utilized toprovide improved grid resolution.

Multi-view based Methods [33] project the raw 3Dpoint clouds into a collection of 2D images rendered from

different viewpoints, followed by image-wise feature ex-traction with well-designed 2D convolutional neural net-works. Then fusing these features forms the final outputrepresentation for various analyses. Although remarkableperformance is achieved, this kind of approaches suffersfrom information loss during the projection process [48]and becomes time-consuming when dealing with sparsepoint clouds. On the other hand, it is difficult to determinethe appropriate number of views for modeling the underly-ing geometric structure [49].

Point Cloud based Methods directly manipulate un-structured and unordered point clouds, and take the 3D co-ordinates or/and RGB or/and normal as initial input. Asthe pioneering work, the emergence of PointNet [26] can beconsidered as a milestone in the domain of learning pointclouds, which guides the development of pointwise MLPmethods. This series of works usually utilize shared MLPto process each point individually to perform feature extrac-tion. However, their performance is limited since they fail tocapture local spatial relationships in the data [48][45]. Re-cent approaches begin to concentrate on defining effectiveconvolution kernels for points. KPConv [36] defines thepoint convolution using any number of kernel points withfilter weights on each point, which gives more fleibgilityand is invariant to point order. FPConv [20] proposes asurface-style convolution for point cloud analysis by learn-ing local flattening, which can be treated as a complemen-tary to the volumetric-style convolution.

Graph based Methods lead to a new trend of irregulardata processing, which represent the point cloud as graphto model the local geometric information among points[35]. ECC [32] and DGCNN [41] propose different edge-dependent convolution operations to aggregate neighboringfeatures spatially. GAC [37] define the filter kernel us-ing the learned attentional weights assigned to neighbor-ing points, which is heleful for semantic segmentation.3D-GCN [21] introduce a well-designed deformable 3D kernelto guarantee scale invariance, and a 3D graph max-poolingoperation to summarize cross-scale features. SPH3D-GCN[17] proposes a separable spherical convolutional kernel forgraph neural networks. This network achieves highly com-petitive performance on the standard benchmarks.

2.2. Transformer in Computer Vision

The Transformer networks can be perceived as a signif-icant breakthrough in Natural Language Processing (NLP),whose success is mainly attributed to the self-attentionmechanisms which can model long-range information anddependencies in the input data [12]. Recently, many ar-chitectures begin to take transformer and self-attention intoconsideration for computer vision task. Image GPT [3] isthe first work to investigate the Transformer for learningimage representation in an unsupervised fashion. ViT [4]

Page 3: Point Cloud Learning with Transformer

FPS

FPS

FPS

MLT

MLT

MLT

UP SA

MPLE

UP SA

MPLE

UP SA

MPLE

MST FC C

FC

FC

N×K

Classification

Feature Embedding Point Pyramid Transformer Multi-level Transformer Multi-scale Transformer Point Cloud based Tasks

Segmentation

Figure 1. The overall architecture of Multi-level Multi-scale Point Transformer model for point cloud analysis. The network mainlycontains three key components: a Point Pyramid Transformer encoding pointwise features with three different scales or resolutions; aMulti-level Transformer and a Multi-scale Transformer modeling cross-level and cross-scale context dependencies representation. pFFNrepresents point Feature Forward Network.

applies the original Transformer to image patches insteadof pixels for image classification task, which can achievethe state-of-the-art performance with less computational re-sources consumption. DETR [2] performs object detectionfrom the set prediction point of view using a transformerencoder-decoder architecture. The advantage of DETR isthat it doesnot require the hand-designed modules (e.g.,non-maximal suppression ) usually used in the previousframeworks.

Inspired by the fundamental mechanism of Transformerin NLP and Computer Vision, we aim to model the cross-level and cross-scale feature dependencies to obtain fine-grained performance for point cloud based tasks with ourwell-designed transformer modules.

3. Point Cloud Representation with Trans-former

As illustrated in Figure 1, given an input point cloudof N points P ∈ RN×C , where C represents the dimen-sion of pointwise properties. Here we aim at modelingcross-level cross-scale interaction to discrimnatively boostthe expressive capability of learned representations. Far-thest point sampling (FPS) is initially performed to obtainthree point clouds with different resolutions, followed bypotential feature learning with feature embedding module.Then, we extract hierarchically multi-scale representationsvia our Point Pyramid Transformer (PPT). For each pathof PPT module, Multi-level Transformer (MLT) consumesthe concatenation of features from different levels to cap-ture point cross-level representation correlation or interac-tion. Finally, through a Multi-scale Transformer (MST), werelate the point feature across different resolutions to learn

a discriminative representation.

3.1. Point Pyramid Transformer

Empirically, we can argue that different resolutions ac-tually corresponding to different scales or having differ-ent size of receptive fields during feature extraction usingthe same operator. Therefore, in order to model a hier-archical semantic or contextual information with differentscales for point cloud, we introduce a point pyramid trans-former module. The farthest point sampling (FPS) opera-tions are progressively performed on input point cloud P toget three point clouds P1, P2, P3 with different resolutionsN1 = N , N2 = N/2 and N3 = N/4, respectively, fol-lowed by generating an initial pointwise feature map pyra-mid P = {F 0

i ∈ RNi×D, i = 1, 2, 3} via feature embed-ding block using pointwise feedforward network. .

The Point Pyramid Transformer (PPT) module takesP as input, where each branch independently maps cor-responding scale feature maps into latent representationF i

PPT ∈ RNi×D′′

, i = 1, 2, 3. Here, since self-attentionmechanism, as the core of Transformer models, is permu-tation invariant and can model long-range context depen-dencies, the essential building block of PPT, therefore, willbased on point self-attention mechanism (PSA) as shown inFigure 2. Formally, we formulate the attention operator asfollows:

F li = A(F l

i ) = σ(Ψ(F li )Φ(F l

i )T /√D)∆(F l

i ) + F li (1)

where i = 1, 2, 3 denotes the ith scale or resolution, l =0, 1, 2, 3, 4 represents the lth layer. Ψ(•), Φ(•) and ∆(•)are linear point transformation used in our paper.

Ψ(F li ) = F l

iWq ∈ RNi×D

(2)

Page 4: Point Cloud Learning with Transformer

Figure 2. Architecture of point self attention mechanism used in Point Pyramid Transformer.

Φ(F li ) = F l

iWk ∈ RNi×D

(3)

∆(F li ) = F l

iWv ∈ RNi×D (4)

W q,W k ∈ RD×D′

,W v ∈ RD×D define the learnableweight parameters. σ adopts the softmax function to nor-malize the weights.

Subsequently, from top to bottom in our PPT, each pathis constructed by stacking four sequential PSA modules toobtain pointwise feature representation from certain resolu-tion of point cloud. Finally, to fully investigate cross-levelinformation, feature maps in different levels together withthe initial input are concatenated to generate the output ofPPT module F i

ppt.

F ippt = concat(F 0

i ,F1i ,F2

i ,F3i ,F4

i ), i = 1, 2, 3 (5)

Actually, we can modify the number of resolution orscale branches, the stacked PSA modules or levels, and thesize of input feature maps according to specific applications.

3.2. Multi-level Transformer

Theoretically, aggregating information contained in dif-ferent levels can boost the expressive power of pointwisefeatures [50]. In order to take fully advantage of multi-levelcontext and model the long-range dependencies or interac-tion across these levels, we introduce Multi-level Trans-former (MLT) module based on multi-head self-attentionmechanism, which consists of three independently parallelMLT corresponding to each scale or resolution path.

Our MLT operator consumes F iPPT to encode much

richer relationships amongst points, which is formally de-fined as:

F iMLT =MA(F i

PPT ) = concat(A1,A2, ...,AM )+F iPPT

(6)

where

Am(F iPPT ) = σ(Qi

m(Kim)T /

√D′′/M)V i

m (7)

Qim = F i

PPTWiQm

(8)

Kim = F i

PPTWiKm

(9)

V im = F i

PPTWiVm

(10)

Here, M is the number of heads. m indicates the mthhead. W i

Qm∈ RD

′′×dq , W i

Km∈ RD

′′×dk , W i

Vm∈

RD′′×dv are learnable weight matrices. We set dq = dk =

dv = D′′/M .

Lastly, these three individual MLT map the level-concatenated features with three resolutions from PPTinto three independent shape-semantic aware representation{F i

MLT , i = 1, 2, 3}.

3.3. Multi-scale Transformer

Generally, point features from different scales or resolu-tions corresponding to different contextual or semantic in-formation [10]. To enhance interaction among low-, mid-and high-resolution, we also adopt multi-head self-attentionmechanism to construct our multi-scale Transformer (MST)module and generates relations across different scales.

The obtained cross-level interacted feature maps F iMLT

from MLT are fed into our MST model. We first sample themaps of these three different scales directly up to the samesize as the original input to our network via interpolationoperation used in PointNet++ [27]. Then we concatenatethem together to encode multi-scale information.

F iup = Up(F i

MLT ), i = 1, 2, 3 (11)

Fcat = concat(F1up,F2

up,F3up) (12)

Page 5: Point Cloud Learning with Transformer

Subsequently, the multi-scale transformer operation isperformed on the Fcat. With the similar multi-head self-attention strategy as Equ.(6), the formulation of MST mod-ule is defined as:

FMST =MA(Fcat) (13)

By integrating Multi-scale Transformer module, we canfurther boost the ability of our network to learn a discrim-inative representation for each point with semantically andgeometrically enhanced information.

4. ExperimentsIn order to evaluate the performance of our proposed

Multi-level Multi-scale Transformer model, we conduct ex-tensive experiments for the problems of 3D point cloudclassification, part segmentation and semantic segmenta-tion and provide comparison with the state-of-the-art ap-proaches. Here, we adopt Pytorch framework to implementour Transformer architecture on NVIDIA Titan RTX with24G memory. And Adam optimizer and step LR learningdecay scheduler are used to train all the models. The follow-ing expands on the discussion of experiments and results.

4.1. Point Cloud Classification

4.1.1 Datasets

The classification task is performed on ModelNet10 andModelNet40 benchmark datasets. ModelNet10 contains2,468/909 training/testing models in 10 categories, whileModelNet40 consists of 12,311 CAD models from 40 cat-egories, in which 9,843 instances are selected for train-ing and 2,468 shapes are utilized for testing. FollowingPointNet[26], 1,024 points are uniformly sampled fromeach object model. During training, we perform opera-tions, including random point dropout, random scaling in[0.8, 1.25] and random shifting in [-0.1, 0.1] on input pointclouds for data augmentation. We train the classificationnetwork for 250 epochs using an initial learning rate 0.0003.The batch size is set to 32.

4.1.2 Performance Comparison

Table 1 quantitatively reports the experimental comparisonswith several state-of-the-art methods. Our MLMST directlyoperates on the raw xyz coordinates of only 1,024 points toyield these results. Specifically, on ModelNet10, our Trans-former network obtains a comparable overall accuracy of95.5%, reaching the second best result. While on Model-Net40, our model achieves the best performance with a su-perior accuracy 92.9%, outperforming the voxel grid-input,multiple views-input and point-input methods. These com-petitive results convincingly demonstrate the effectivenessof our MLMST.

4.2. Part Segmentation

Part segmentation is a challenging task aiming to assigna part label to each point in a given 3D point cloud object.

4.2.1 Datasets

We evaluate the part segmentation task on the broadly usedShapeNet part dataset [47], which covers 16,881 shapes of3D point cloud objects from 16 different categories. Theobject in each category are labeled with less than 6 parts,amounting to 50 parts in total, where each point is associ-ated with a part label. These models are split into 14,007examples for training and 2,874 models for testing. Here,we sample 2,048 points from the dataset. During training,we adopt the same data augmentation strategy as that forclassification. We train our segmentation model 180 epochswith a mini batch size of 8.

4.2.2 Evaluation metric

The Intersection-over-Union (IoU) on points is consideredas the metric for quantitatively evaluating the segmentationresults of our model and comparison with other existingmethods. Following the previous works [26], we definethe IoU of each category as the average of IoUs for all theshapes belonging to each category. In addition, the overallmean IoU (mIoU) is finally calculated by taking average ofIoUs across all the shape instances.

4.2.3 Results

Table 2 summarizes the performance comparison betweenour MLMST with several baselines. From the quantitativeresults, it can be clearly seen that our Transformer modelreaches a much better part segmentation performance withthe instance mIoU of 86.4%, outperforming the state-of-the-art approaches, RS-CNN and ELM, by 0.2% and 1.2%,respectively. The visualization of part segmentation on theShapeNet part is given in Figure 3. These results show therobustness of our MLMST to diverse shapes.

4.3. Semantic Segmentation in Scenes

4.3.1 Dataset

Here, we use the Stanford Large-Scale 3D Indoor SpacesDataset (S3DIS) [1] to experiment our model on seman-tic scene segmentation. This benchmark provides 3D pointclouds collected by Matterport scanners in 6 indoor areascontaining 271 rooms from three different buildings, whereeach point is associated with one of 13 semantic labels (e.g.chair and ceiling). Following schema from [37][34] for bet-ter evaluating model’s generalizability, we single out Area5 (which is in a building different from others) as our testset and the rest areas are used to train our model. During

Page 6: Point Cloud Learning with Transformer

Table 1. Comparisons of recognition accuracy(%) on ModelNet10 and ModelNet40. The best results are shown in bold.Method Representation Input Size ModelNet10 ModelNet40

3DShapeNets [42] Volumetric 303 83.5% 77.3%VoxNet [24] Volumetric 323 92.0% 83.0%OctNet [30] Volumetric 1283 90.9% 86.5%

MVCNN [33] Multi-view 12× 2242 - 90.1%DeepNet [29] Points 5000× 3 - 90.0%Kd-Net [13] Points 215 × 3 93.5% 88.5%

PointNet [26] Points 1024× 3 - 89.2%PointNet++ [27] Points+normals 5000× 6 - 91.9%

ECC [32] Points 1000 ×3 90.0% 83.2%DGCNN [41] Points 1024× 3 - 92.2%

PointCNN [19] Points 1024× 3 - 92.5%KC-Net [31] Points 1024× 3 94.4% 91.0%

FoldingNet [46] Points 2048× 3 94.4% 88.4%Point2Sequence [22] Points 1024× 3 95.3% 92.6%OctreeGCNN [16] Points - 94.6% 92.0%

KPConc [36] Points 6800× 3 - 92.9%SFCNN [28] Points+normals 1024× 6 - 92.3%3D-GCN [21] Points 1024× 3 - 92.1%

ELM [6] Points 1024× 3 95.7% 92.2%FPConv [20] Points+normals - - 92.5%

SPH3D-GCN [17] Points 1000× 3 - 92.1%MLMST Points 1024× 3 95.5% 92.9%

Table 2. Experimental comparison of part segmentation with the state-of-the-art approaches on ShapeNet part dataset. The mean IoUacross all the shape instances and IoU for each category are reported.

Method mIoU aero bag cap car chair ep guitar knife lamp laptop motor mug pistol rocket skate tableShapeNet [47] 81.4 81.0 78.4 77.7 75.7 87.6 61.9 92.0 85.4 82.5 95.7 70.6 91.9 85.9 53.1 69.8 75.3PointNet [26] 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6

PointNet++ [27] 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6KD-Net [13] 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 71.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3SO-Net [18] 84.9 82.8 77.8 88.0 77.3 90.6 73.5 90.7 83.9 82.8 94.8 69.1 94.2 80.9 53.1 72.9 83.0RGCNN [35] 84.3 80.2 82.8 92.6 75.3 89.2 73.7 91.3 88.4 83.3 96.0 63.9 95.7 60.9 44.6 72.9 80.4DGCNN [41] 85.2 84.0 83.4 86.7 77.8 90.6 74.7 91.2 87.5 82.8 95.7 66.3 94.9 81.1 63.5 74.5 82.6

SRN [5] 85.3 82.4 79.8 88.1 77.9 90.7 69.6 90.9 86.3 84.0 95.4 72.2 94.9 81.3 62.1 75.9 83.2SFCNN [28] 85.4 83.0 83.4 87.0 80.2 90.1 75.9 91.1 86.2 84.2 96.7 69.5 94.8 82.5 59.9 75.1 82.9RS-CNN [23] 86.2 83.5 84.8 88.8 79.6 91.2 81.1 91.6 88.4 86.0 96.0 73.7 94.1 83.4 60.5 77.7 83.63D-GCN [21] 85.1 83.1 84.0 86.6 77.5 90.3 74.1 90.9 86.4 83.8 95.6 66.8 94.8 81.3 59.6 75.7 82.6

ELM [6] 85.2 84.0 80.4 88.0 80.2 90.7 77.5 91.2 86.4 82.6 95.5 70.0 93.9 84.1 55.6 75.6 82.1SOCNN [49] 85.7 83.9 84.1 85.0 77.4 91.3 78.3 91.7 87.4 83.8 96.4 69.7 93.5 83.1 58.9 76.2 82.9

Weak Sup. [44] 85.0 83.1 82.6 80.8 77.7 90.4 77.3 90.9 87.6 82.9 95.8 64.7 93.9 79.8 61.9 74.9 82.9MLMST 86.4 84.4 84.7 89.2 80.2 89.4 77.1 92.3 87.5 85.3 96.7 71.6 95.2 84.2 61.3 76.0 83.6

training, we use 4,096 points sampled from dataset as inputto train our Transformer model. We utilize 200 epochs andset the batch size as 6.

4.3.2 Results

Table 3 presents the quantitative evaluation of the experi-mental results of our MLMST, where we also make a faircomparison with several state-of-the-arts. From the table,we can clearly state that our MLMST architecture achievescompetitive performance with 62.9% mIoU on Area 5 eval-uation, 21.8% higher than the PointNet. Specifically, ourmodel gains leading results on floor, wall, chair and board

classes. We additionally provide qualitative comparison be-tween semantic segmentation results and ground truth inFigure 4. These quantitative and qualitative results fur-ther validate the high effectiveness and promises of ourMLMST.

4.4. Ablation Study

In this section, we perform extensive ablation study to in-vestigate the effectiveness of each individual components ofour MLMST architecture using ModelNet10 for evaluation.Specifically, we adopt the single-resolution MLP as ourbaseline. Table 4 summarises the classification accuracy of

Page 7: Point Cloud Learning with Transformer

Figure 3. Visualization of part segmentation on ShapeNet Part.

Table 3. Experimental comparison of semantic segmentation with the state-of-the-art approaches on S3DISMethod mIoU ceiling floor wall beam column window door table chair sofa bookcase board clutter

SegCloud [34] 48.9 90.1 96.1 69.9 0.0 18.4 38.4 23.1 70.4 75.9 40.9 58.4 13.0 41.6PCCN [38] 58.3 92.3 96.2 75.9 0.3 6.0 69.3 63.5 66.9 65.6 47.3 68.9 59.1 46.2

ShapeContextNet [43] 52.7 - - - - - - - - - - - - -PointNet [26] 41.1 88.8 97.3 69.8 0.1 3.9 46.3 10.8 58.9 52.6 5.9 40.3 26.4 33.2

PointCNN [19] 57.3 92.3 98.2 79.4 0.0 17.6 22.8 62.1 74.4 80.6 31.7 66.7 62.1 56.7SGPN [39] 54.4 79.4 66.3 88.8 78.0 60.7 66.6 56.8 46.9 40.8 6.4 47.6 11.1 -

SPGraph [14] 58.0 89.4 96.9 78.1 0.0 42.8 48.9 61.6 84.7 75.4 69.8 52.6 2.1 52.2DGCNN [41] 56.1 - - - - - - - - - - - - -PointWeb [52] 60.3 92.0 98.5 79.4 0.0 21.1 59.7 34.8 76.3 88.3 46.9 69.3 64.9 52.5FPConv [20] 62.8 94.6 98.5 80.9 0.0 19.1 60.1 48.9 80.6 88.0 53.2 68.4 68.2 54.9

Weak Sup. [44] 48.0 90.9 97.3 74.8 0.0 8.4 49.3 27.3 69.0 71.7 16.5 53.2 23.3 42.8SPH3D-GCN [17] 59.5 93.3 97.1 81.1 0.0 33.2 45.8 43.8 79.7 86.9 33.2 71.5 54.1 53.7

MLMST 62.9 94.5 98.7 90.6 0.0 21.1 60.0 51.4 83.0 89.6 28.9 70.7 74.2 55.5

Table 4. Ablation analysis on our Multi-level Multi-scale Trans-former architecture.

Method AccuracyBaseline 86.0

Baseline + PPT 93.1%Baseline + PPT + MLT 92.3%Baseline + PPT + MST 94.6%

MLMST 95.5%

different design choices. From these results, we can claimthat the integration of PPT, MLT and MST modules achievesignificant performance improvement over baseline. Thisfurther demonstrates that feature interactions across differ-ent levels and scales are beneficial to discriminative pointcloud representation learning.

4.5. Conclusion

In this paper, we introduced, Mult-level Multi-scalePoint Transformer, an end-to-end architecture relying onself-attention mechanism for point cloud analysis, which in-tegrates three fundamental building modules, a point pyra-mid transformer, a multi-level transformer and a multi-scaletransformer, to enrich contextual interaction across differ-ent levels and scales. Extensive experiments conductedon challenging benchmarks have demonstrated our MLM-

SPT achieves the state-of-the-arts performance on 3D ob-ject classification and segmentation. We believe that Trans-former can play an important role in learning point cloudrepresentation, therefore, further investigation of its devel-opment and application to various point cloud based tasksshould be explored in future.

References[1] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioan-

nis Brilakis, Martin Fischer, and Silvio Savarese. 3d seman-tic parsing of large-scale indoor spaces. In Computer Visionand Pattern Recognition, 2016.

[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Confer-ence on Computer Vision, pages 213–229. Springer, 2020.

[3] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-woo Jun, David Luan, and Ilya Sutskever. Generative pre-training from pixels. In International Conference on Ma-chine Learning, pages 1691–1703. PMLR, 2020.

[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-vain Gelly, et al. An image is worth 16x16 words: Trans-formers for image recognition at scale. arXiv preprintarXiv:2010.11929, 2020.

Page 8: Point Cloud Learning with Transformer

Figure 4. Visualization of point cloud semantic segmentation on S3DIS dataset.

[5] Yueqi Duan, Yu Zheng, Jiwen Lu, Jie Zhou, and Qi Tian.Structural relational reasoning of point clouds. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 949–958, 2019.

[6] Kent Fujiwara and Taiichi Hashimoto. Neural implicit em-bedding for point cloud analysis. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 11734–11743, 2020.

[7] Zan Gojcic, Caifa Zhou, Jan D Wegner, Leonidas J Guibas,and Tolga Birdal. Learning multiview 3d point cloud reg-istration. In Proceedings of the IEEE/CVF conference oncomputer vision and pattern recognition, pages 1759–1769,2020.

[8] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu,and Mohammed Bennamoun. Deep learning for 3d pointclouds: A survey. IEEE transactions on pattern analysis andmachine intelligence, 2020.

[9] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, YulanGuo, Zhihua Wang, Niki Trigoni, and Andrew Markham.Randla-net: Efficient semantic segmentation of large-scalepoint clouds. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 11108–11117, 2020.

[10] Zitian Huang, Yikuan Yu, Jiawen Xu, Feng Ni, and Xinyi Le.Pf-net: Point fractal network for 3d point cloud completion.In Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, pages 7662–7670, 2020.

[11] Haiyong Jiang, Feilong Yan, Jianfei Cai, Jianmin Zheng, andJun Xiao. End-to-end 3d point cloud instance segmentationwithout detection. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages12796–12805, 2020.

[12] Salman Khan, Muzammal Naseer, Munawar Hayat,Syed Waqas Zamir, Fahad Shahbaz Khan, and MubarakShah. Transformers in vision: A survey. arXiv preprintarXiv:2101.01169, 2021.

[13] Roman Klokov and Victor Lempitsky. Escape from cells:Deep kd-networks for the recognition of 3d point cloud mod-

els. In Proceedings of the IEEE International Conference onComputer Vision, pages 863–872, 2017.

[14] Loic Landrieu and Martin Simonovsky. Large-scale pointcloud semantic segmentation with superpoint graphs. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4558–4567, 2018.

[15] Truc Le and Ye Duan. Pointgrid: A deep network for 3dshape understanding. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 9204–9214, 2018.

[16] Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guidedcnn with spherical kernels for 3d point clouds. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, 2019.

[17] Huan Lei, Naveed Akhtar, and Ajmal Mian. Spherical kernelfor efficient graph convolution on 3d point clouds. IEEEtransactions on pattern analysis and machine intelligence,2020.

[18] Jiaxin Li, Ben M Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud analysis. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 9397–9406, 2018.

[19] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di,and Baoquan Chen. Pointcnn: Convolution on x-transformedpoints. In Advances in Neural Information Processing Sys-tems, pages 820–830, 2018.

[20] Yiqun Lin, Zizheng Yan, Haibin Huang, Dong Du, LigangLiu, Shuguang Cui, and Xiaoguang Han. Fpconv: Learn-ing local flattening for point convolution. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 4293–4302, 2020.

[21] Zhi-Hao Lin, Sheng-Yu Huang, and Yu-Chiang Frank Wang.Convolution in the cloud: Learning deformable kernels in 3dgraph convolution networks for point cloud analysis. In Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 1800–1809, 2020.

[22] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and MatthiasZwicker. Point2sequence: Learning the shape representa-

Page 9: Point Cloud Learning with Transformer

tion of 3d point clouds with an attention-based sequence tosequence network. In AAAI, 2019.

[23] Yongcheng Liu, Bin Fan, Shiming Xiang, and ChunhongPan. Relation-shape convolutional neural network for pointcloud analysis. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019.

[24] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-volutional neural network for real-time object recognition.In 2015 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 922–928. IEEE, 2015.

[25] Ehsan Nezhadarya, Ehsan Taghavi, Ryan Razani, BingbingLiu, and Jun Luo. Adaptive hierarchical down-sampling forpoint cloud classification. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 12956–12964, 2020.

[26] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3d classificationand segmentation. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 652–660,2017.

[27] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas JGuibas. Pointnet++: Deep hierarchical feature learning onpoint sets in a metric space. In Advances in neural informa-tion processing systems, pages 5099–5108, 2017.

[28] Yongming Rao, Jiwen Lu, and Jie Zhou. Spherical fractalconvolutional neural networks for point cloud recognition.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 452–460, 2019.

[29] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos.Deep learning with sets and point clouds. arXiv preprintarXiv:1611.04500, 2016.

[30] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.Octnet: Learning deep 3d representations at high resolutions.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 3577–3586, 2017.

[31] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Min-ing point cloud local structures by kernel correlation andgraph pooling. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 4548–4557,2018.

[32] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks ongraphs. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 3693–3702, 2017.

[33] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and ErikLearned-Miller. Multi-view convolutional neural networksfor 3d shape recognition. In Proceedings of the IEEE in-ternational conference on computer vision, pages 945–953,2015.

[34] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoungGwak, and Silvio Savarese. Segcloud: Semantic segmen-tation of 3d point clouds. In 2017 International Conferenceon 3D Vision (3DV), pages 537–547. IEEE, 2017.

[35] Gusi Te, Wei Hu, Amin Zheng, and Zongming Guo. Rgcnn:Regularized graph cnn for point cloud segmentation. In2018 ACM Multimedia Conference on Multimedia Confer-ence, pages 746–754. ACM, 2018.

[36] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud,Beatriz Marcotegui, Francois Goulette, and Leonidas JGuibas. Kpconv: Flexible and deformable convolution forpoint clouds. In Proceedings of the IEEE/CVF InternationalConference on Computer Vision, pages 6411–6420, 2019.

[37] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, andJie Shan. Graph attention convolution for point cloud seman-tic segmentation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 10296–10305, 2019.

[38] Shenlong Wang, Simon Suo, Wei-Chiu Ma, AndreiPokrovsky, and Raquel Urtasun. Deep parametric continu-ous convolutional neural networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 2589–2597, 2018.

[39] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neu-mann. Sgpn: Similarity group proposal network for 3d pointcloud instance segmentation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2569–2578, 2018.

[40] Yue Wang and Justin M Solomon. Deep closest point: Learn-ing representations for point cloud registration. In Proceed-ings of the IEEE/CVF International Conference on Com-puter Vision, pages 3523–3532, 2019.

[41] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,Michael M Bronstein, and Justin M Solomon. Dynamicgraph cnn for learning on point clouds. arXiv preprintarXiv:1801.07829, 2018.

[42] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3dshapenets: A deep representation for volumetric shapes. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 1912–1920, 2015.

[43] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. At-tentional shapecontextnet for point cloud recognition. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4606–4615, 2018.

[44] Xun Xu and Gim Hee Lee. Weakly supervised semanticpoint cloud segmentation: Towards 10x fewer labels. In Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 13706–13715, 2020.

[45] JuYoung Yang, Chanho Lee, Pyunghwan Ahn, Haeil Lee,Eojindl Yi, and Junmo Kim. Pbp-net: Point projection andback-projection network for 3d point cloud segmentation.arXiv preprint arXiv:2011.00988, 2020.

[46] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Fold-ingnet: Point cloud auto-encoder via deep grid deformation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 206–215, 2018.

[47] Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, MengyanYan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer,Leonidas Guibas, et al. A scalable active framework for re-gion annotation in 3d shape collections. ACM Transactionson Graphics (TOG), 35(6):210, 2016.

[48] Yang You, Yujing Lou, Qi Liu, Yu-Wing Tai, Lizhuang Ma,Cewu Lu, and Weiming Wang. Pointwise rotation-invariantnetwork with adaptive sampling and 3d spherical voxel con-

Page 10: Point Cloud Learning with Transformer

volution. In Proceedings of the AAAI Conference on Artifi-cial Intelligence, volume 34, pages 12717–12724, 2020.

[49] Chaoyi Zhang, Yang Song, Lina Yao, and Weidong Cai.Shape-oriented convolution neural network for point cloudanalysis. In Proceedings of the AAAI Conference on Artifi-cial Intelligence, volume 34, pages 12773–12780, 2020.

[50] Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xi-ansheng Hua, and Qianru Sun. Feature pyramid transformer.In European Conference on Computer Vision, pages 323–339. Springer, 2020.

[51] Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu.Fusion-aware point convolution for online semantic 3d scenesegmentation. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 4534–4543, 2020.

[52] Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia.Pointweb: Enhancing local neighborhood features for pointcloud processing. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages5565–5573, 2019.