1 laplaciannet: learning on 3d meshes with …1 laplaciannet: learning on 3d meshes with laplacian...

1

LaplacianNet: Learning on 3D Meshes withLaplacian Encoding and Pooling

Yi-Ling Qiao, Lin Gao, Jie Yang, Paul L. Rosin, Yu-Kun Lai, and Xilin Chen

Abstract—3D models are commonly used in computer vision and graphics. With the wider availability of mesh data, an efficient andintrinsic deep learning approach to processing 3D meshes is in great need. Unlike images, 3D meshes have irregular connectivity,requiring careful design to capture relations in the data. To utilize the topology information while staying robust under differenttriangulation, we propose to encode mesh connectivity using Laplacian spectral analysis, along with Mesh Pooling Blocks (MPBs) thatcan split the surface domain into local pooling patches and aggregate global information among them. We build a mesh hierarchy fromfine to coarse using Laplacian spectral clustering, which is flexible under isometric transformation. Inside the MPBs there are poolinglayers to collect local information and multi-layer perceptrons to compute vertex features with increasing complexity. To obtain therelationships among different clusters, we introduce a Correlation Net to compute a correlation matrix, which can aggregate thefeatures globally by matrix multiplication with cluster features. Our network architecture is flexible enough to be used on meshes withdifferent numbers of vertices. We conduct several experiments including shape segmentation and classification, and our LaplacianNetoutperforms state-of-the-art algorithms for these tasks on ShapeNet and COSEG datasets.

Index Terms—Mesh Processing, Segmentation, Laplacian, Deep Learning

F

1 INTRODUCTION

ANALYZING high-quality 3D models is of great signifi-cance in computer vision and graphics. Better under-

standing of 3D shapes would benefit many tasks such assegmentation, classification and shape analysis. Recently,deep learning methods have been prevalent in 2D imageprocessing tasks such as image classification [1], [2] and se-mantic segmentation [3], [4]. With the help of large-scale im-age datasets and improved computational resources, deeplearning methods boost performance of image processingalgorithms by a large margin. Inspired by the success inimages, researchers also apply learning algorithms to 3Ddata.

Recently, large-scale 3D datasets have made it possibleto train neural networks for 3D shapes. Nevertheless, it isnot a simple extension to apply neural networks in the 3Dspace. There are various 3D representations. The majorityof 3D representations such as meshes, point clouds etc.are non-canonical, requiring special design to put themthrough neural networks. To address this, some approachesare trained on ModelNet [5] and deal with voxels, but theresolution of voxel data is limited due to the curse of dimen-sionality. Alternatively, point clouds representing an objectby a set of unstructured points with their xyz coordinatesare commonly used. However, point clouds do not carryconnectivity information and therefore are less efficient thanmeshes to represent shapes, and may have ambiguitieswhen two surfaces are close. Another representation, the3D mesh, is a fundamental data structure in computer

• Y.-L Qiao, L. Gao and J. Yang are with the Beijing Key Laboratoryof Mobile Computing and Pervasive Device, Institute of ComputingTechnology, Chinese Academy of Sciences, Beijing, China.

• P.L Rosin and Y.-K Lai are with School of Computer Science andInformatics, Cardiff University, Wales, UK.

• X. Chen is with Institute of Computing Technology, Chinese Academy ofSciences, Beijing, China.

graphics and vision, which not only encodes geometry butalso topology and therefore has better descriptive powerthan the point cloud. A mesh is a graph with vertices, edgesand faces that characterize the surface of a shape. For deeplearning methods, mesh data is more compact but irregularwhen compared to voxels, making simple operations in theimage domain such as convolutional kernels highly non-trivial. It also contains richer structure than a point cloud,which can be exploited when learning on 3D meshes. Thispaper aims to propose a flexible network structure thatcan perceive connectivity information while staying robustunder different triangulation.

To learn on 3D meshes, we propose the Laplacian Encod-ing and Pooling Network (LaplacianNet), which takes rawfeatures of mesh models as input and outputs a functiondefined on the vertices. Inspired by image processing net-works, we observe an intuitive principle that vertex featuresshould be computed independently and associatively indifferent parts of the network. We therefore extend thisideology into non-Euclidean space of 2-manifolds. In ourdesign as in Figure 1, the basic network structure involvesconsecutive Mesh Pooling Blocks (MPBs). Each block cansplit the surface into patches, like super-pixels in the imagedomain, by Laplacian spectral clustering. After splitting,the MPB can simultaneously compute features of individualvertices and clusters. Considering the relationships betweenclusters, we use a Correlation Net to compute a matrix thatcan fuse the information globally. Compared to images, amajor difficulty for learning on meshes is that the verticesare unordered, and so are the clusters. For this reason, afully-connected layer cannot work out the correlation matrixeffectively. Therefore, we disentangle the correlation by in-dependently mapping the clusters into a vector space. Then,the correlation between a pair of clusters is determined bythe inner product of their corresponding vectors.

arX

iv:1

910.

1406

3v1

[cs

.GR

] 3

0 O

ct 2

019

2

Before our LaplacianNet, effort has been made to learnon 3D meshes. Please refer to [6] for history and frontierresearch of deep learning on geometric data. Some worksexploit geodesic distances on shapes [7], [8]. Based onthis metric they design a spatial convolutional operator.Geodesic distance could be invariant under isometric trans-formation, making their networks more robust. Comparedto their directly computing the distance, our network uti-lizes the geodesic distance in an implicit way. The cluster-ing is performed in the Laplacian feature domain, whereEuclidean distance can approximate geodesic distance onthe manifold [9]. As a result, the pooling could stay robustunder isometric deformation and different triangulation(shown in Figure 3). Others choose to work in the spectraldomain [10], [11], by defining convolutional kernels in theFourier domain. However, the dependency of coefficientson the domain basis makes it difficult to share weightsacross shapes. Consequently, works like [12] have modulespurposefully designed to synchronize the basis of domains.Our network does not suffer from changing domains, andwe also propose a flexible structure, Correlation Net, toaddress the alignment of clusters across models. Comparedto all the aforementioned methods, we use both spectral andspatial information, such that our network can utilize theconnectivity of meshes while staying robust under differentdomains with inconsistent triangulation.

The pipeline of our approach is shown in Figure 1, whichlearns cross-domain mesh functions using both spatial andspectral information. Overall, LaplacianNet has the follow-ing three modules: 1) the preprocessing step computesvertex features and clusters from the raw mesh data; 2)the Mesh Pooling Block (MPB) calculates local featureswithin local regional clusters and collects global informationthrough Correlation Net; 3) the last part of the networkdepends on the specific application, e.g., it may outputsegmentation masks or classification labels.

In the experiments, we first evaluate the importance ofmesh pooling blocks and the choice of the input features.We then justify that our single network can deal well withdifferent mesh models with different numbers of vertices.To test the capability of our network, we train our networkon the ShapeNet and COSEG datasets to perform classifica-tion and segmentation tasks, which are fundamental shapeunderstanding tasks in computer vision, and show superioroverall performance.

The main contributions of our method are as follows:

• We propose the Laplacian Encoding and PoolingNetwork (LaplacianNet), a general network forlearning on 3D meshes, which can utilize theconnectivity of meshes while staying robust underdifferent triangulation.

• We propose a flexible pooling strategy that can splitmodel surfaces into clusters, like superpixels in im-ages. By varying the clusters from fine to coarse, thenetwork can process meshes hierarchically.

• We introduce a Correlation Net to compute the re-lationship among clusters. The computation processcircumvents the randomness of cluster ordering, en-abling consistency across domains.

2 RELATED WORK

We first summarize deep learning methods on 3D represen-tations, and then provide a brief introduction to alternativeinput shape features. Finally, we review recent methodsfor mesh segmentation, which is a fundamental task whenanalyzing shapes.3D Deep Learning. With the increasing availability of 3Dmodels, deep learning methods for analyzing 3D data struc-ture have been widely studied nowadays. There are sev-eral representations for 3D shapes, including voxels, pointclouds and meshes. The voxel representation is similar topixels in the 3D space, which can utilize a direct exten-sion of 2D convolutional networks [13]. The point cloudis intensively researched, with works like [14] learningthe transformation matrix of points and obtaining decentresults on multiple datasets. Qi et al. [15] further proposePointNet++ that adds pooling operations, where poolingareas are selected by nearest neighbors. We aim at differentproblems: While they address problems on point clouds,our method focuses on meshes. To better understand 3Dmeshes, the topology information is deeply exploited inour method. First, we use topology to perform cluster-ing. In contrast, the nearest neighbor pooling strategy ofPointNet++ cannot perceive surface structure of meshes.Second, we use features that contain topology informationas input. Meanwhile, we believe that it is not appropriateto carry the entire topology during the whole process. Theconnectivity information is irregular, large, hard to compute,and sensitive to noise. We instead condense the topologyinformation by encoding the connectivity into the poolingareas and input features.

For the mesh representation, spatial methods [7], [8]define convolutional kernels on surfaces. Our LaplacianNetalso utilizes spatial information, as the pooling strategypartitions the surface into patches, acting like super-pixelsin image processing. For spectral methods, Bruna et al. [10]introduce a spectral convolutional layer on graphs, whichcan be interpreted as convolutions in the Fourier domain.Henaff et al. [16] handle large-scale classification prob-lems. They propose to exploit the underlying relationshipsamong individual data points by constructing a graph ofthe dataset and then solving a graph-based classification.When processing the graph, they introduce a CNN structurewith spectral pooling. Compared with ours, their work isinterested in each node and the pooling cluster does notcontain geometric information. A fundamental problem ofspectral convolution is generalizing across domains sincethe coefficients of spectral filters are basis dependent [6].To address this problem, Yi et al. [12] further proposeSyncSpecCNN to perform convolutional operations in asynchronized spectral domain, where they train a SpectralTransformer Network to align the functions in differentdomains. Compared to all the spectral CNN methods above,our LaplacianNet takes advantage of Laplacian spectralanalysis to encode mesh topology and identify a spatialpooling strategy but avoids suffering from its dependencyon the domain. We also show in the Experiments sectionthat our method is robust under different object categories,number of vertices, and triangulation.Shape Features. Over the years, many shape features have

3

Fig. 1. Overview of our Laplacian Pooling Network (LaplacianNet). Given a 3D mesh as input with the aim of producing some kind of vertex function(depending on the application) as output, the pipeline of our network has three main components. First we preprocess the mesh to computeLaplacian eigenvectors and spectral clustering. These along with vertex coordinates and normals form the input feature to the network. Second, westack several Mesh Pooling Blocks (MPBs) to analyze the shape model under multiple resolutions. An MPB includes some multi-layer perceptrons(MLP) to compute features independently for each vertex, and also a pooling layer to compute the features for clusters. A Correlation Net learnsa correlation matrix C to fuse features across clusters. Then the fused features are concatenated with MLP results, and then associated withindividual vertices in the cluster. After a sequence of MPBs (two in this illustration), the features are fed into Application Network to produce outputaccording to specific applications.

been developed to describe shape characteristics, includ-ing curvatures, geodesic shape contexts, geodesic distancefeatures as used in [17], Heat Kernel Signatures [18], WaveKernel Signatures [19], etc. Our LaplacianNet exploits meshLaplacian spectral analysis, which provides an effectiveencoding of mesh topology. Laplacian eigenvectors alsohelp us to cluster vertices for pooling layers, and it is anintrinsic feature describing the geometry of shapes. Read-ers can refer to [20] for details about computation andapplications of graph Laplacian. However, two essentialproblems with Laplacian eigenvectors are that the sign is notwell-defined, and perturbation occasionally occurs in highfrequency terms. To eliminate these ambiguities, we use theabsolute value of low frequency terms as input.

Mesh Segmentation. Mesh segmentation has long been afundamental task in the field of computer vision and graph-ics. There are unsupervised and supervised methods toperform this task. For unsupervised methods, recent workusually uses the correspondence or correlation betweenshape units to co-segment a collection of objects in the samecategory [21], [22], [23], [24]. Those methods essentially ana-lyze a whole dataset of similar 3D shapes and cluster shapeparts that can be consistently segmented into one class.Other works try to take advantage of labeled data to developa supervised method. Thanks to recent shape segmentationdatasets [25], [26], [27], [28], supervised methods obtainhigher accuracy than unsupervised ones. Among all thosedatasets, COSEG [26] and ShapeNet [26] have sufficientlymany samples to train a network with reasonable general-

izability, so we conduct experiments on the two datasets.Previous deep learning methods usually design differentarchitectures to do segmentation. For example, George etal. [29] design a multi-branch 1D convolutional network andWang et al. [30] put convolutional kernels on neighboringpoints and a pooling layer on the coarsened mesh. A 2DCNN is embedded into a 3D surface network by a projectionlayer in [31]. Guo et al. [32] concatenate different featurevectors into a rectangular block and apply CNNs to thisimage-like domain. Our method aims to develop a generalnetwork for learning on 3D meshes, and we demonstratethat our general approach outperforms existing methods inmost cases.

3 METHODOLOGY

3.1 Problem StatementOur proposed network is a general method for 3D meshes,capable of dealing with different numbers of vertices. Sup-pose that the current input G = (V, E) is a mesh with Nvertices. There is an input feature function f defined onvertices V , i.e., f : V → RN×c where c is the dimensionof input features, typically containing coordinates, normals,mesh Laplacian, curvatures, etc. At the same time, there isa target function g we aim to produce. It can usually be avector function such as a segmentation g : V → Cs

N whereeach entry corresponds to the label of a vertex, or a singlevalue like classification category for the whole object g :G → Cl, where Cs and Cl are the sets of segmentation labelsand classification categories respectively. We would like to

4

mention that g may also represent other functions includingtexture or normal. Our aim is to design a general neuralnetwork that learns the mapping from input feature f to theoutput g. For the segmentation and classification tasks, weprecompute the normal and mesh Laplacian eigenvectors asinput features. Our network will output an N × |Cs| matrixfor segmentation, which gives the score for each vertexbelonging to each segmentation label, or a |Cl| dimensionalsoftmax vector for classification.

3.2 Feature PrecomputationWe aim to use minimal features to characterize local geo-metric information. For this purpose, we precompute thevertex normals and mesh Laplacian eigenvectors, whichalong with vertex positions are used as the input to ournetwork. The mesh Laplacian provides an effective way toencode mesh connectivity.

For later pooling layers, we use Laplacian spectral clus-tering [20] at multiple resolutions. Different from spectralconvolutions in [16], our pooling layers reduce any numberof vertices to a desired dimension, which makes it possiblefor our method to cope with meshes with different topology.In practice, the Laplacian matrix is computed by

L = A−1(D −W ) (1)

where A = diag(a1, ..., an) are vertex weights definedas local Voronoi areas ai, equal to one third of the sumof one-ring triangles areas. W = {wij}i,j=1,...,N is thesparse cotangent weight matrix, discretization of continuousLaplacian over mesh surfaces [33], and D is the degreematrix which is a diagonal matrix with diagonal entriesdii =

∑Nj=1 wij . The Laplacian feature Φ is the eigenvectors

of matrix L. Please refer to [20] for detailed computationand applications of graph Laplacian.

To cluster vertices at different levels, we perform k-means clustering on Φ with different numbers of clustersk = pl such that vertices are clustered into pl clustersfor the lth pooling block. l = 1, 2, . . . , L, and L is thenumber of levels. To achieve local-global feature extraction,pl decreases as l increases. Note that since the clustering isin the feature domain, vertices on the surface within onecluster are not necessarily connected. This is also reasonablebecause some semantically similar vertices can be far away.

3.3 Network ArchitectureThe architecture of our LaplacianNet is illustrated in Fig-ure 1.

Given the preprocessed feature function f defined onvertices V , our LaplacianNet takes the feature matrix as in-put, each row being a feature vector of a vertex. By reducingthe input features to a matrix, we avoid the complex graphstructure and make it tractable for neural networks.

Several mesh pooling blocks (MPBs) are then applied invarious resolutions for multiple times. At the end of MPBblocks, the application network outputs the target functiong. Details about pooling blocks will be further discussed inthe next section. The architecture of the application networkfor classification and segmentation is shown in Figure 2.

In our design, to circumvent the complex and irregu-lar topology of mesh data, we seek a pipeline that can

concisely describe the relationship among vertices. Insteadof directly processing edges E , we simplify this problemby only processing vertices V of mesh G. Nevertheless,edges are not ignored, but instead implicitly encoded intoLaplacian eigenvectors and spectral clusters as describedin the previous subsection. Since the mesh Laplacian isintrinsically induced from geodesic distances, our methodis robust to remeshing and isometric transformations.

Moreover, since the total number of vertices in a meshcan vary significantly from model to model, an ideal net-work architecture should be able to deal with meshes withdifferent numbers of vertices. Our solution is to designmesh pooling blocks, which turn meshes of arbitrary sizesinto levels with the same number of clusters. By stackingtogether several blocks in a multi-resolution manner, ournetwork can learn to extract useful features from the mesh.In addition, our network uses parameters more effectivelywith shared weight Multi-Layer Perceptron (MLP), whichalso helps avoid over-fitting by reducing the complexityof our network. Our method consequently achieves goodresults for shape classification and segmentation.

3.4 Mesh Pooling Blocks

A mesh pooling block is composed of three modules: theMulti-Layer Perceptron (MLP) layers [34], the pooling lay-ers, and a Correlation Net. Figure 1 shows an illustrationof a mesh pooling block. Each mesh pooling block obtainsits input feature Fl of size N × cl from the previous layer. Italso gets the cluster maskMl from the precomputation step,where an entry ml,i ∈ {1, 2, . . . , pl} in the mask Ml ∈ ZN

indicates for node i which cluster it belongs to. The totalnumber of clusters for the lth block is pl.

As illustrated in Figure 1, the data flows through threebranches. The upper path is a series of MLP layers, learningvertex features of increasing complexity; the middle path isthe pooling layer followed by a correlation matrix multipli-cation, which fuses global and local information; the bottompath is a Correlation Net that computes the correlationmatrix, learning the interaction among clusters.

The upper path of the MPB in Figure 1 is a set of MLPlayers with shared weight perceptrons connected to all thevertices. For a certain vertex i, an MLP layer multiplies itsinput Fl,i and weight matrix W with bias b, followed by aReLU(·) activation function. The operation of this layer canbe expressed as

MLP (Fl,i) = ReLU(WFl,i + b) (2)

For the pooling layer, the input includes features Fl fromthe last block and a cluster mask Ml. Its result is definedas applying the operation to all the nodes belonging to thesame cluster. Take max pooling as an example. The poolingresult Pl,j corresponding to the jth cluster of the lth blockis:

Pl,j = maxml,i=j

Fi (3)

Furthermore, we want the features to be computedacross clusters. For images, convolutional kernels can beused to fuse pooling results. However, since triangle meshesdo not have a regular grid and consistent clustering, suchan approach does not work. Standard global pooling can be

5

Fig. 2. The application network for segmentation and classification. For the segmentation task (top row), the application specific network takesas input the category label of a certain item and the features defined on vertices. The vertex features go through two MLP layers and then areduplicated into two branches, one of which is sent to a global pooling, combined with the one-hot vector of input label and then attached back tovertex features from the other branch. Finally, two MLP layers are used to compute the final segmentation mask. For the classification network(bottom row), the vertex features sequentially go through MLP layers, global pooling, and fully connected layers. Eventually a softmax vector forcandidate labels is predicted.

a simple choice, but each cluster has equal contribution anddetailed information is lost. At the bottom path in Figure 1,to aggregate information from all clusters, we multiply thepooling results with a correlation matrix Cl = {cij}pl×pl

.Each entry cij measures the correlation between the ith

and jth clusters, such that the aggregated pooling resultis obtained as Pl = ClPl. The Correlation Net computesthe matrix C by learning latent vector embedding for eachcluster:

Ψl = {ψlj} = Pooling(MLP (Fl)), (4)

and entries of the correlation matrix are inner products oflatent vectors of cluster pairs clij =< ψli,ψlj >.

Pl = ΨlΨTl Pl (5)

The concatenation layer combines vertex features fromthe upper-path MLPs and aggregated pooling results. Forthe vertex i in cluster j, its output feature is written as

{MLP (Fl,i), Pl,j} (6)

In summary, the MPB is used for both processing thelocal information and finding the relationships among localpatches. The motivation for pooling in the mesh is to betterunderstand spatial information. Ideally, the input verticesare hierarchically clustered into different areas, and therelationships between areas need to be considered. In prac-tice, we process the hierarchical information by clusteringvertices into different numbers of clusters. The CorrelationNet learns to compute a correlation matrix to describe therelative relationships among spatial areas after pooling.Using Laplacian clustering ensures that the clustering isrobust under different triangulation. Also, the Euclideandistance in the Laplacian feature domain approximates thegeodesic distance [9]. Therefore, such a clustering can findmeaningful patches in the spatial domain. Moreover, theclustered areas give a reasonable partition, as shown in thecamel example in Figure 1 where vertices with the samecolor roughly belong to one semantic area.

4 EXPERIMENTS

4.1 Implementing Details

We implement our method using Tensorflow. We train andtest the network on a desktop with an NVIDIA GTX1080GPU. The optimization is achieved using the Adam [35]solver with learning rate 7× 10−4 and momentum 0.9. Thenetwork is trained for 200 epochs with batch size 8. Theinput features have 22 dimensions, including 6 dimensionsof positions and normals, and the other 16 dimensions arethe absolute values of the Laplacian eigenvectors corre-sponding to the 16 lowest frequencies. According to theexperimental results in the next section, by default wechoose to use two Mesh Pooling Blocks with 16 and 8clusters respectively. There are two MLP layers in a meshpooling block outputting 128 and 256 channels. FollowingMPBs is a specific application network based on the taskto be performed. In total, the network’s depth is 11 for thesegmentation and classification tasks.

4.2 Network Evaluation

We now evaluate different parts of our network. This seriesof evaluation is conducted on the COSEG dataset, withresults shown in Tab. 1.

First we test the usefulness of the input features. Theperformance of a certain algorithm can be affected by thefeatures used. In the experiments, we use coordinates, nor-mals and Laplacian eigenvectors as vertex features. In thispart, we test the network without one of those three features.We can see that a combination of all the features achieves thebest results.

Second, we test the setting of Mesh Pooling Blocks. Bydefault, our method uses 2 MPBs. We compare this withalternative numbers of MPBs ranging from 0 to 3. Theresults show that 2 MPBs (our default method) gives the bestperformance. We also test the usefulness of our CorrelationNet. Ours-noCorrNet is the network without the CorrelationNet and matrix C. We observe that the the aggregation ofglobal information is important for our network.

6

Third, another difficulty when handling mesh datais varying mesh connectivity. As mentioned before, ourmethod is robust under different mesh triangulation as aresult of the pooling strategy. The Laplacian eigenfunctionis induced from geodesic distances and therefore invariantunder isometric transformation, so the pooling areas aswell as input features can essentially stay unchangedwhen we remesh the object. We visualize the connectivityof the objects before and after remeshing in Figure 3,which is obtained by subdivision followed by meshdecimation [36] to generate irregular connectivity. Moreover,the quantitative results (Ours-remesh) in Tab. 1 show thatour network is robust under different triangulation.

Fourth, our network does not rely on the same vertexnumbers for models. An experiment is performed to testthis. In this experiment, we first simplify COSEG models to1500, 2000 and 2500 vertices. Then we split the training andtest set in the same strategy for all three resolutions. We trainour network on a mix of two of the datasets and test modelsfrom the third (Ours-1500, Ours-2000 and Ours-2500 in thetable). Accuracies in Tab. 1 show that our network workswell with varying vertex numbers, compared to the last rowwhere LaplacianNet is trained with models all containing2048 vertices.

4.3 Part Segmentation

In this section, we use MPBs to conduct part segmentationon the ShapeNet [26], [37] and COSEG [27] datasets. Theapplication network for segmentation is illustrated in Fig-ure 2(top row). Its input is the category label and featuresfrom the previous MPB. The output is a softmax score foreach category. The application network has two perceptronlayers, a maxpooling layer and two fully connected layers.We minimize the cross-entropy loss between the one-hotvector of ground truth and the network output.

ShapeNet is a large repository of shapes from multiplecategories. To leverage this dataset to perform segmentation,Yi et al. [26] develop an active learning method for efficientannotation. However, their annotations are not directly onthe mesh vertices, but on the point cloud resampled from theshapes. To recover the graph structure of manifold surfacesfor computation of mesh Laplacian and segmentation, weapply [38] on the original ShapeNet models. After that, wetransfer the annotations on the point clouds to the nearestmesh vertices.

Two metrics are used in previous segmentation resultson ShapeNet, namely accuracy and IoU (Intersection-over-union). We compute both of them to compare with state-of-the-art deep learning methods on 3D shapes [12], [31], andsome other methods performing segmentation on ShapeNetbased on point clouds such as [39]. Tab. 2 shows that ourmethod achieves the highest average accuracy, and outper-forms state-of-the-art methods on 13 out of 16 categories. Interms of IoU, our performance is comparable to the state-of-the-art, achieving the best performance in 8 categories.Some segmentation results are presented in Figure 4.

The COSEG [27] dataset is also a commonly usedbenchmark for shape segmentation. Compared to ShapeNet,COSEG is much smaller. It has 8 scanned categories and 3synthetic categories. Each of the 8 categories has around 20

TABLE 1Segmentation accuracy on COSEG. We compare with [40] and [30] inthe first two rows. In the last row, LaplacianNet is trained on models

with 2048 vertices. To test the robustness on different vertex, wesimplify COSEG models to 1500, 2000 and 2500 vertices respectively.

We train three networks on two of the three datasets but test on thethird. Ours-1500, Ours-2000 and Ours-2500 show the accuracy whenthe test set has 1500, 2000 and 2500 vertices. We observe that ournetwork performs similarly well with different vertex numbers. Stable

performance is also obtained when applying our method to remeshedmodels with more irregular connectivity (Ours-remesh). The ablation

test on the features shows that all three kinds of features contribute tothe performance. We also vary the number of MPBs and find that using

two MPBs performs best. In general, our method achievesstate-of-the-art results in all three categories.

Chair Vase Tele-alienXie et al. [40] 87.1% 85.9% 83.2%

Wang et al. [30] 95.9% 91.2% 93.0%Ours-noMPB 76.6% 77.8% 80.7%Ours-1MPB 85.3% 86.1% 88.9%Ours-3MPBs 90.1% 92.2% 91.6%

Ours-noCorrNet 86.9% 85.6% 90.7%Ours-1500 90.3% 90.0% 89.0%Ours-2000 90.9% 91.6% 89.3%Ours-2500 87.0% 86.6% 88.5%

Ours-remesh 92.1% 91.5% 91.8%Ours-noCoordinates 90.6% 88.6% 84.2%

Ours-noNormal 87.6% 86.1% 85.0%Ours-noLaplacian 79.6% 87.1% 86.1%

Ours 94.2% 92.2% 93.9%

Fig. 3. Robustness to changing connectivity. In this experiments, wechange the connectivity of meshes from the COSEG datasets, and testwhether our network can consistently perform well. The first row is theoriginal objects, while the meshes in the second row are remeshedinto more irregular triangulation. As shown in Table 1, the segmentationaccuracy still remains high.

models, which are too few for deep learning. The 3 syntheticcategories each have 900 models, so we test our algorithmwith the synthetic categories. We compare our result with[40] and [30] in Tab. 1. Our approach outperforms both ofthem.

4.4 Mesh Classification

In this section, we evaluate the accuracy of object cate-gorization. The input features to this task are the sameas the segmentation task. The application-specific networkfollowing Mesh Pooling Blocks is shown in Figure 2(bottomrow). It contains two perceptron layers, a maxpooling layerand two fully connected layers. In this case the output isa softmax score for each category. We minimize the cross-

7

TABLE 2Accuracy and IoU of different methods on ShapeNet. For the task of 3D shape segmentation, we compare our method with Shapeboost [17],

Guo [32], and ShapePFCN [31] in the metric of accuracy. And we compare with FeaStNet [39], ACNN [8], Yi [12], and VoxelCNN [12] in the IoU(Intersection-over-union) metric. Our LaplacianNet achieves highest accuracy.

Method mean plane bag cap car chair ear guitar knife lamp laptop motor mug pistol rocket skate tablephone bike board

Shapeboost (acc) 77.2 85.8 93.1 85.9 79.5 70.1 81.4 89.0 81.2 71.1 86.1 77.2 94.9 88.2 79.2 91.0 74.5Guo (acc) 77.6 87.4 91.0 85.7 80.1 66.8 79.8 89.9 77.1 71.6 82.7 80.1 95.1 84.1 76.9 89.6 77.8

ShapePFCN (acc) 85.7 90.3 94.6 94.5 90.2 82.9 84.9 91.8 82.8 78.0 95.3 87.0 96.0 91.5 81.6 91.9 84.8ours (acc) 91.5 89.6 90.2 88.2 88.2 83.2 82.3 95.6 88.7 87.4 96.3 70.6 97.0 92.7 82.2 94.7 92.6

FeaStNet (IoU) 81.5 79.3 74.2 69.9 71.7 87.5 64.2 90.0 80.1 78.7 94.7 62.4 91.8 78.3 48.1 71.6 79.6ACNN (IoU) 79.6 76.4 72.9 70.8 72.7 86.1 71.1 87.8 82.0 77.4 95.5 45.7 89.5 77.4 49.2 82.1 76.7

VoxelCNN (IoU) 79.4 75.1 72.8 73.3 70.0 87.2 63.5 88.4 79.6 74.4 93.5 58.7 91.8 76.4 51.2 65.3 77.1Yi (IoU) 84.7 81.6 81.7 81.9 75.2 90.2 74.9 93.0 86.1 84.7 95.6 66.7 92.7 81.6 62.1 82.9 82.1

ours (IoU) 84.3 82.9 83.4 81.7 80.0 75.4 71.8 91.9 81.0 80.9 92.5 59.2 93.5 86.3 74.3 90.3 86.4

Fig. 4. Qualitative results on ShapeNet. The segmentation resultsproduced by our method are plausible.

TABLE 3Classification accuracy on ModelNet10, ModelNet40 and ShapeNet.

Distinguished by input representations, SPH [41], SyncSpecCNN [12],ACNN [8] and FoldingNet [8] use meshes; PointNet [14], PointNet++

[14] and SO-Net [42] use point clouds; Voxelnet [43] and3DShapeNets [5] take voxels as input. ‘-’ in the table indicates the

performance is not reported. For the classification accuracy onShapeNet, our LaplacianNet achieves state-of-the-art performance.

LaplacianNet outperforms all those single model classification methods.

Method input MN10 MN40 ShapeNetPointNet point - 89.2% -

PointNet++ point - 91.9% -SO-Net point 95.7% 93.4% -

3DShapeNets volume 83.5% 77.0% -Voxnet volume 91.0% 84.5% -ACNN mesh - - 93.99%

SyncSpecCNN mesh - - 99.71%SPH mesh - 68.2% -Ours mesh 97.4% 94.21% 99.88%

entropy loss between one-hot vector of ground truth andnetwork output.

For this classification task, we use ModelNet10, Model-Net40 [5] and ShapeNet. In each experiment, the networkis trained for 200 epochs. Our method outperforms all state-of-the-art single model shape classification approaches in allthe datasets.

4.5 Failure CasesWe would like to restate that our method does not workdirectly on the original ShapeNet models for two reasons:1) the annotations are labeled on point clouds uniformlysampled from shapes, instead of vertices of the mesh; 2)meshes in ShapeNet are not manifold meshes, preventing usfrom performing high-quality spectral clustering. Therefore

Fig. 5. Failure cases in ShapeNet segmentation. We show poorsegmentation results in four categories in ShapeNet. (a) is theground truth and (b) is our prediction. Those imperfect resultsare not necessarily caused by our segmentation method but alsoimperfect labeling in the dataset and error accumulation duringpreprocessing. Following examples show main challenges. The groundtruth segmentation mask of the chair is not smooth, which shows thelimitation of the annotation. Some regions of segmentation annotationon the bag are incorrect. This kind of errors can be caused duringtransferring the mesh into a watertight manifold. Our method also suffersfrom lacking training examples and fails to produce correct results on themotorbike. The skateboard has a hole in the mesh which leads to poorsegmentation around it.

we convert shapes into watertight manifold surfaces using[38], simplify the mesh to a reasonable level, and transferpart labels to vertices from the point cloud through nearestneighbor matching. Error accumulates during this process.In Figure 5 we show some failure cases when performingpart segmentation in ShapeNet. There are several typicalproblems for models that are poorly segmented. In thechair example, we can observe sharp edges in ground truthlabels, and such artifacts could be caused by the noise inthe original point cloud. This kind of noise makes learningreasonable segmentation more challenging. As in the bagcase, we can observe in the red circle that the ground truth isincorrect. This may be caused by changes of the shape whileusing [38] to make the model watertight, so that the nearestneighbor algorithm gets the wrong correspondence whentransferring labels. We can see that our algorithm actuallygets a more correct answer than the “ground truth”. Failurein the motorbike case is caused by lack of training data.In the annotations, the motorbike class has the fewest datasamples. Worse still, some of the training examples fail to beconverted when performing [38], because motorbikes have

8

complex topological structure, making transfer challenging.In practice, it can be a big problem that errors in the transferand simplification will decrease the amount of training data.Last but not least, the skateboard shows that there might besome holes, or other artifacts, in the surface, that could affectgraph structure and mislead our algorithm.

5 CONCLUSION

In this paper, we present a deep learning approach topredicting functions defined on shapes. The key idea isto perform multiscale pooling based on Laplacian spectralclustering, and use a Correlation Net following pooling tofuse global information. Compared to the pooling operationin the previous literature, our network does not requirea uniform number of vertices in each model. Our workoutperforms state-of-the-art methods in most categories forshape classification and segmentation.

Several future extensions can be explored. First ourLaplacianNet may be applied to other tasks. For example,3D reconstruction is fundamental but challenging. A ca-pable and general method to generate meshes is of greatdemand. Our network has the potential to achieve thistask, because it can neatly encode connectivity in vertices,and intrinsically understands the topology through spectralclustering and spatial pooling. Furthermore, as a generalstructure for mesh processing, our network may also beapplied to shape deformation, completion and correspon-dence.

Another area of future work is to design a kernel withtrainable parameters, like convolutional kernels in images.For instance, a Gaussian kernel can be calculated insidea pooling cluster. Meanwhile, the transpose convolutionplays an important role in traditional CNN frameworks. Thecombination of forward and transpose convolutions lets thenetwork freely down-sample and up-sample data.

Finally it would be interesting to see how our networkcan work on general graphs. In this paper, we mainly dealwith manifold meshes, using geodesic distances to constructLaplacian. However in other problem settings, we can useany distance metric that can best describe the problem. Forexample, we might experiment with our method on socialnetworks with arbitrary size and connectivity.

ACKNOWLEDGMENTS

This work was supported by National Natural ScienceFoundation of China (No. 61872440 and No. 61828204),Beijing Natural Science Foundation (No. L182016), YoungElite Scientists Sponsorship Program by CAST (No.2017QNRC001), Youth Innovation Promotion AssociationCAS, Huawei HIRP Open Fund (No. HO2018085141),CCF-Tencent Open Fund, SenseTime Research Fund andOpen Project Program of the National Laboratory of PatternRecognition (NLPR).

REFERENCES

[1] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classification with convolutionalneural networks,” in Proceedings of the IEEE conference on ComputerVision and Pattern Recognition, 2014, pp. 1725–1732.

[2] M. Chen, G. Ding, S. Zhao, H. Chen, Q. Liu, and J. Han, “Referencebased LSTM for image captioning.” in AAAI, 2017, pp. 3981–3987.

[3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutionalnetworks for semantic segmentation,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2015, pp. 3431–3440.

[4] S. Kwak, S. Hong, B. Han et al., “Weakly supervised semanticsegmentation using superpixel pooling network.” in AAAI, 2017,pp. 4111–4117.

[5] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3d shapenets: A deep representation for volumetric shapes,” inProceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1912–1920.

[6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-dergheynst, “Geometric deep learning: going beyond Euclideandata,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42,2017.

[7] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst,“Geodesic convolutional neural networks on Riemannian mani-folds,” in Proceedings of the IEEE international conference on computervision workshops, 2015, pp. 37–45.

[8] D. Boscaini, J. Masci, E. Rodola, and M. Bronstein, “Learning shapecorrespondence with anisotropic convolutional neural networks,”in Advances in Neural Information Processing Systems, 2016, pp.3189–3197.

[9] K. Crane, C. Weischedel, and M. Wardetzky, “The heatmethod for distance computation,” Commun. ACM, vol. 60,no. 11, pp. 90–99, Oct. 2017. [Online]. Available: http://doi.acm.org/10.1145/3131280

[10] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectralnetworks and locally connected networks on graphs,” arXivpreprint arXiv:1312.6203, 2013.

[11] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graphconvolutional neural networks,” arXiv preprint arXiv:1801.03226,2018.

[12] L. Yi, H. Su, X. Guo, and L. J. Guibas, “SyncSpecCNN:Synchronized spectral CNN for 3d shape segmentation.” in CVPR,2017, pp. 6584–6592.

[13] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas,“Volumetric and multi-view CNNs for object classification on 3ddata,” in CVPR, 2016, pp. 5648–5656.

[14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deeplearning on point sets for 3d classification and segmentation,”Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, vol. 1,no. 2, p. 4, 2017.

[15] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deephierarchical feature learning on point sets in a metric space,” inAdvances in Neural Information Processing Systems, 2017, pp. 5099–5108.

[16] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networkson graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.

[17] E. Kalogerakis, A. Hertzmann, and K. Singh, “Learning 3d meshsegmentation and labeling,” ACM Transactions on Graphics (TOG),vol. 29, no. 4, p. 102, 2010.

[18] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provablyinformative multi-scale signature based on heat diffusion,” inComputer graphics forum, vol. 28, no. 5. Wiley Online Library,2009, pp. 1383–1392.

[19] M. Aubry, U. Schlickewei, and D. Cremers, “The wave kernelsignature: A quantum mechanical approach to shape analysis,”in Computer Vision Workshops (ICCV Workshops), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 1626–1633.

[20] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics andcomputing, vol. 17, no. 4, pp. 395–416, 2007.

[21] Z. Wu, Y. Wang, R. Shou, B. Chen, and X. Liu, “Unsupervisedco-segmentation of 3d shapes via affinity aggregation spectralclustering,” Computers & Graphics, vol. 37, no. 6, pp. 628–637, 2013.

[22] Z. Shu, C. Qi, S. Xin, C. Hu, L. Wang, Y. Zhang, and L. Liu,“Unsupervised 3d shape segmentation and co-segmentation viadeep learning,” Computer Aided Geometric Design, vol. 43, pp. 39–52, 2016.

[23] Q. Huang, V. Koltun, and L. Guibas, “Joint shape segmentationwith linear programming,” ACM Transactions on Graphics (TOG),vol. 30, no. 6, p. 125, 2011.

[24] O. Sidi, O. van Kaick, Y. Kleiman, H. Zhang, and D. Cohen-Or,Unsupervised co-segmentation of a set of shapes via descriptor-spacespectral clustering. ACM, 2011, vol. 30, no. 6.

http://doi.acm.org/10.1145/3131280

http://doi.acm.org/10.1145/3131280

9

[25] X. Chen, A. Golovinskiy, and T. Funkhouser, “A benchmark for 3dmesh segmentation,” ACM Transactions on Graphics (TOG), vol. 28,no. 3, p. 73, 2009.

[26] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang,A. Sheffer, L. Guibas et al., “A scalable active framework for regionannotation in 3d shape collections,” ACM Transactions on Graphics(TOG), vol. 35, no. 6, p. 210, 2016.

[27] Y. Wang, S. Asafi, O. Van Kaick, H. Zhang, D. Cohen-Or, andB. Chen, “Active co-analysis of a set of shapes,” ACM Transactionson Graphics (TOG), vol. 31, no. 6, p. 165, 2012.

[28] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The Princetonshape benchmark,” in Shape modeling applications, 2004. Proceedings.IEEE, 2004, pp. 167–178.

[29] D. George, X. Xie, and G. K. Tam, “3d mesh segmentation viamulti-branch 1d convolutional neural networks,” Graphical Models,vol. 96, pp. 1–10, 2018.

[30] P. Wang, Y. Gan, P. Shui, F. Yu, Y. Zhang, S. Chen, and Z. Sun,“3d shape segmentation via shape fully convolutional networks,”Computers & Graphics, vol. 70, pp. 128–139, 2018.

[31] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri, “3d shapesegmentation with projective convolutional networks,” in Proc.CVPR, vol. 1, no. 2, 2017, p. 8.

[32] K. Guo, D. Zou, and X. Chen, “3d mesh labeling via deepconvolutional neural networks,” ACM Transactions on Graphics(TOG), vol. 35, no. 1, p. 3, 2015.

[33] M. Meyer, M. Desbrun, P. Schroder, and A. H. Barr, “Discretedifferential-geometry operators for triangulated 2-manifolds,” inVisualization and Mathematics III, 2003, pp. 35–57.

[34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learninginternal representations by error propagation,” California UnivSan Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.

[35] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980, 2014.

[36] M. Garland and P. S. Heckbert, “Surface simplification usingquadric error metrics,” in Proceedings of the 24th AnnualConference on Computer Graphics and Interactive Techniques, ser.SIGGRAPH ’97. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1997, pp. 209–216. [Online]. Available:https://doi.org/10.1145/258734.258849

[37] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “ShapeNet:An information-rich 3d model repository,” arXiv preprintarXiv:1512.03012, 2015.

[38] J. Huang, H. Su, and L. Guibas, “Robust watertight manifoldsurface generation method for ShapeNet models,” arXiv preprintarXiv:1802.01698, 2018.

[39] N. Verma, E. Boyer, and J. Verbeek, “FeaStNet: Feature-steeredgraph convolutions for 3d shape analysis,” in CVPR 2018-IEEEConference on Computer Vision & Pattern Recognition, 2018.

[40] Z. Xie, K. Xu, L. Liu, and Y. Xiong, “3d shape segmentationand labeling via extreme learning machine,” in Computer graphicsforum, vol. 33, no. 5. Wiley Online Library, 2014, pp. 85–95.

[41] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz, “Rotationinvariant spherical harmonic representation of 3d shape descrip-tors,” in Symposium on geometry processing, vol. 6, 2003, pp. 156–164.

[42] J. Li, B. M. Chen, and G. H. Lee, “SO-Net: Self-organizing networkfor point cloud analysis,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2018, pp. 9397–9406.

[43] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neuralnetwork for real-time object recognition,” in Intelligent Robots andSystems (IROS), 2015 IEEE/RSJ International Conference on. IEEE,2015, pp. 922–928.

https://doi.org/10.1145/258734.258849

1 laplaciannet: learning on 3d meshes with …1 laplaciannet: learning on 3d meshes with laplacian...

Documents