3d saliency based on supervoxels rarity in point clouds

3D Saliency Based on Supervoxels Rarity in Point Clouds

Julien Leroy, Nicolas Riche, Matei Mancas and Bernard GosselinTCTS Lab, Faculty of Engineering, University of Mons (UMONS)

{firstname.surname}@umons.ac.be

Abstract— Visual saliency is a computational process thatseeks to identify the most attention-drawing regions from avisual point of view. Most methods of salience are basedon characteristics such as color, texture and more recentlytried to introduce the depth information, which is knownto be an important cue in the human cognitive system. Wepresent a new full 3D mechanism of computational attentionfor extracting salient information directly from a 3D point cloudwhich can be either on a single RGBD view or the fusion ofmultiple viewpoints. The proposed method reduces the pointcloud complexity through over-segmentation in order not toprocess all voxels information, but only supervoxels, whichare larger color an geometrically consistent 3D structures. Avery interesting feature of our algorithm is that it provides abottom-up saliency map which is viewpoint independent. Thisbottom-up saliency map can then be specialized by using top-down information for a given viewpoint by adding a centeredGaussian and the viewpoint-dependent depth information. Ourapproach was tested on a complex database of 80 color depthand picture pairs with associated pixel level ground truthsegmentation masks. The use of 3D Point Clouds improves theresults compared to depth maps extended models even if onlythe color feature is used.

I. INTRODUCTIONVisual saliency is an important part of the human attention

mechanism. Modeling attention computationally is a widelystudied topic as affecting many areas such as object detec-tion, segmentation and scene analysis, image compression,robotics, etc. The aim of saliency algorithms is to automat-ically predict human attention and to select a subset of allthe perceived stimuli. Bottom-up attention uses signal-basedfeatures to find the most salient objects. Top-down attentionuses a priori knowledge about the scene or task-orientedknowledge to modify the bottom-up saliency. In this paper,saliency algorithms are based on the bottom-up approach,but benefits of top-down information are also discussed. Theliterature is been prolific in the field of visual attention on2D images. Recently, some models which integrate depthinformation have emerged such as those proposed by [1], [2],[3], [4]. These tentatives to bring depth into saliency modelsare often extensions of existing 2D models and they are notspecifically built to process 3D data. Some papers address thesaliency of meshes as in [5], [6] or saliency on point cloud as[7], [8] but they exploit only the geometric information of 3Ddata. As more and more data from RGBD sensors is available(as Ms Kinect sensor), it becomes possible to work on bothgeometric and color information. Our goal is to design anew saliency model capable of processing large clouds ofcolored points while offering performance of state-of-the-artmodels of salient object detection. The proposed model fully

exploits the 3D aspect of the data which can be made of largeorganized (or not) point clouds. To do so, we first proceed tothe decomposition of an input cloud into supervoxels. Thisapproach dramatically reduces the amount of informationto process while minimizing the loss of information asinput. A multi-level supervoxels decomposition let us capturedifferent sizes of objects in the scene. Secondly, we use theconcept of rarity to extract salient data [9]. Indeed, rarenessis an efficient way to extract salient data and it does notdepend on any spatial structure. The use of supervoxels andthe rarity-based approach make our method able to operateon multiple data sources, organized or not, which opens thepossibility to work on data merged from several sensors.In the next section, we provide a state of the art of thelatest object-oriented saliency models which will be used tovalidate our model. In section III we describe our algorithmwhich is validated in section IV. Finally we discuss ourfindings and conclude in the last section.

II. RELATED WORK

Models of visual attention can be split in two maincategories based on their purpose. The first category ofmodels aims to predict the human eye gaze distribution. Thesecond category focuses on finding interesting objects. Ourmodel fits in the second category and intends to segmentcomplex scenes into an object hierarchy based on the objectsof interest. In this section, we briefly present recent state-of-the-art models providing object-based saliency maps, mainlybased on the concept of 2D images saliency and extendedto integrate depth information. Some of them are extendedto use depth feature maps (called further in this paper 2.5Dmodels). Those models are the ones also used to asses ourmethod in section IV. Indeed, if some models process 3Ddata, they are only based on geometrical features and withoutan available validation making a comparison with our modelis impossible.

In [10], the authors aim at an extension of the visualattention model with the integration of depth in the compu-tational model built around conspicuity and saliency maps.This model is an extension of center surround 2D saliencywith depth proposed by [11].

In [1], the method constructs 3D layout and shape featuresfrom depth measurement that they integrate with image basedsaliency. This method an extension of center surround 2Dsaliency with depth proposed by [13].

Fig. 1. Our method is divided in 3 major steps: (1) multiscale supervoxelsdecomposition, (2) color rarity applied on multiple color spaces, (3) inter-level and inter-feature fusion. A top-down centered gaussian can be use tosimulate the human centric preference [12].

III. SUPERRARE3D: SUPERVOXEL 3D SALIENCY

We propose a novel object-oriented algorithm of bottom-up attention dedicated to analyze colored point clouds. Thismodel builds on the one proposed in [14]. One contribution isthe use of a rarity-based approach not based on superpixels asin [14] but on supervoxels. Supervoxels consist of an over-segmentation of a point cloud in regions with comparablesizes and characteristics (in terms of color and other 3Dfeatures). More details on supervoxels and the method usedhere are provided in the next sections. Our approach has fourmajor interests:

1) Supervoxels let us reduce the amount of processing andallow our method to work on organized or unorganizedclouds. Thus, it can analyze point clouds or even fused

point clouds coming from various sensors.2) Supervoxels allow us to have an object-oriented ap-

proach in the 3D space.3) Supervoxels multi-level decomposition allows us to

maintain detection performance regardless of the sizeof the salient objects present in the data.

4) This approach provides a bottom-up 3D saliency mapwhich is viewer-independent. It is possible to addviewer-dependent top-down information as a viewer-dependent centered Gaussian and depth information.In our paper we only use the centered Gaussian thatall the other models also use to remain fair in ourcomparison.

Our method only uses one feature of the point cloud: thecolor. Other features like supervoxels orientation or otherspecific 3D features will be taken into account in futurework. As the color feature is the only one we use, thisapproach is subject to the influence of the choice of the colorrepresentation. To provide a first solution to this influence,we propose to fuse the saliency maps computed on severalcolor spaces. Our algorithm can be divided into three majorstages: (1) supervoxels decomposition, (2) supervoxel rarity-based saliency mechanism, (3) fusion. We present in thefollowing sub-sections the three main steps of our algorithm.

A. Super Voxels Cloud Segmentation

The superpixels are the result of over-segmentation of animage into regions of pixels having similar perceptual prop-erties. This is a step commonly used in computer vision as apreprocessing stage to reduce the amount of information tobe processed while still minimizing the loss of information.

We build our system, on the same idea by using supervox-els instead of processing at the point level. We use the VoxelCloud Connectivity Segmentation method (VCCS) [15] thatextracts supervoxels from an organized or unorganized pointcloud. The supervoxels can replace the structure of the voxel-based original point cloud by a set of atomic regions thatcapture the local redundancy of information. They providea convenient way to summarize the point cloud and thusgreatly reduce the complexity of the following analysisprocess. But if there is a major difference between the size ofsupervoxels and the size of the salient object to be detected,this one can by merged with a nearby supervoxel and itsinformation is lost like in Figure 2. To avoid this situation, therarity mechanism is applied to different levels of supervoxelsdecomposition so that at some level of detail the salientobject is well captured. At this point the pathway of thealgorithm is split between the different levels of supervoxelsdecomposition. This separation is made to capture all theinformation of salients objects by adjusting the size ofsupervoxels. Indeed,like shown in Figure 2, if a supervoxel istoo large, it may not stick properly to an object and it is seendisappearing into an adjacent supervoxel. Supervoxels sizeis an essential and sensitive parameter for the performanceof the algorithm. For the method to be effective, the sizemust be set so as to decompose a object in a small subset ofsupervoxels. Empirically, we observed that the method works

well when 3 or 4 supervoxels represent a salient element. Toremedy this, the algorithm works on several levels in parallelthat will be merged into a final saliency map, to maintainboth the information of large objects and smaller ones, whilerefining the segmentation of salient regions.

Fig. 2. (a) Example of table point cloud with a box and cups, (b)supervoxels segmentation using VCCS with a seed size of 0.5m, (c)supervoxels segmentation using VCCS with a seed size of 0.25m. The sizeof the supervoxels is essential to extract the information of all the objects.If the seed is too large, like in (b), we see the object being absorbed in anadjacent supervoxel, losing the information for the rarity mechanism.

Fig. 3. Examples of results obtained with our method on 2 different pointclouds: (left) a Kinect cloud (organized, 307200 points, three levels of de-composition with 92, 32 and 13 supervoxels); (right) a point cloud recordedusing a Riegl VZ-400 and a co-calibrated Canon 1000D camera with 10Megapixels [16] (unorganized, 5976977 points, 1 level of decompositionwith 271 supervoxels).

B. Rarity based saliency

The rarity mechanism consists, for each supervoxel vector,to compute the cross-scale occurrence probability of eachof the N supervoxels. At each color component i, a rarityvalue is obtained by the self-information of the occurrenceprobabilities of the supervoxel as shown in Eq.(1). Pi is the

occurrence probability of each supervoxel Svi value in re-spect with the empirical probability distribution representedby the histogram within the ith color channel.

Rarity(Svi)i =−log(Pi/N) (1)

Then, the self-information is used to represent the attentionscore for the supervoxel region. This mechanism provideshigher scores for rare regions. The rarity value falls between0 (all the supervoxels are the same) and 1 (one supervoxeldifferent from all the others).

1) Intra and inter-supervoxels level fusion: The raritymaps obtained from the rarity mechanism on each colorchannel (in this case, we select 6 color space representations:HSV, HLS, YUV, RGB, Lab, Luv) are first intra-colorcombined. In the example of Figure II, we empirically selecta 2 levels decomposition using supervoxels seed of 0.05mand 0.02m for balance between accuracy and computationtime. A fusion between same color rarity maps is achievedat each decomposition level by using the fusion methodproposed in Itti et al. [17]. The idea is to provide a higherweight to the map which has important peaks compared toits mean (Eq. 2).

S =N

∑i=1

ECi ∗mapi (2)

where ECi is the efficiency coefficient for each channeland is computed as in Eq. 3.

ECi = (maxi−meani)2 (3)

These coefficients let us sort the different maps (mapi)based on each map efficiency coefficient ECi. Each map ismultiplied by a fixed weight defined as i = 1...K where Kis the number of maps to mix (here K = 3) and i the rankof the sorted maps as shown in the first line of Eq.4. T is aempirical threshold defined in [9].

∀i ∈ [1,K]

{saliencyi = 0 if ECi

ECK< T

saliencyi =iK ∗mapi if ECi

ECK> T

(4)

At the end of this first fusion process, the model providesone saliency map per color space representation. The secondfusion step, an inter-color feature fusion between each mapcoming from the different color space representation, isachieved using the same method as the one explained forthe inter-decomposition level fusion (Eq. 2).

C. Color Space influenceOur method estimates saliency using the rarity only on

color feature. The accuracy of this feature is very importantand our method is strongly influenced by the choice ofthe color space representation. If we observe independentlysaliency maps for the different color modes, we can see thatthe performance is highly dependent on the selected mode,ranging from excellent to poor, but in all cases at least onemap provides good performance. For this reason we havechosen to apply the rarity on several color spaces and mergethe different rarity maps.

D. Final Saliency Map

Finally, in this case, we work with an organized pointcloud, we apply a Gaussian centered filter to represent thecentral preference that people exhibit in images [12]. In thecase of object avoidance, this centered human preferencemakes also sense in the context of robotics as one wantsto correct the path of a robot to avoid collisions with objectsin front of it.

IV. VALIDATION

A. Database

The database that we used to validate our method waspublished by [1]. It has 80 shots obtained using a MicrosoftKinect sensor mounted on a Willow Garage PR2 robot. Thedatabase consists of RGB images, depth maps and pointclouds associated with pixel level ground truth segmentationmasks. The 80 scenes are very complex both in terms ofnumber and shape of objects, colors, illumination but also interms of depth differences.This database is oriented towardrobot navigation and object manipulation. Indeed, there area lot of bjects which have little depth difference with thoseobjects.

B. Metric

Several measures like the Area-Under-the-Curve (AU-ROC) the Precision-Recall, have been suggested to evaluatethe accuracy of salient object detection maps. However, asshown in [18], these most commonly-used measures do notalways provide a reliable evaluation. The authors start byidentifying three causes of inaccurate evaluation: 1) interpo-lation flaw 2) dependency flaw and 3) equal-importance flaw.By amending these three assumptions, they propose a newreliable measure called Fw

β−measure and defined in Eq. 5.

Fwβ= (1+β

2)Precisionw ∗Recallw

β 2 ∗Precisionw +Recallw (5)

with

Precisionw = T Pw

T Pw+FPw

Recallw = T Pw

T Pw+FNw .

A weighted error has been defined to resolve the flaws.This metric provides better evaluation than previous mea-sures. We will use this new method to validate SuperRare3Don the database in order to be as fair and precise as possible.

C. Method

We made the validation of our SuperRare3D model (calledSR3D) in two steps. First, we computed a 2D saliency mapas a view of the 3D saliency map (2D projection). Wecompared SR3D to 5 other depth-extended (2.5D) models.The Weighted F-measure is used to compare SR3D with 2.5Dsaliency methods given a pre-segmented ground truth.

Fig. 4. Comparison of our model with 5 state-of-the art 2.5 saliency modelsfrom the database [1]. SR3D outperforms the 2.5D models.

D. Qualitative Results

Our full-3D model (SR3D) provides a 3D viewpointindependent saliency map of any kind of organized orunorganized point cloud. Figure 3 shows 2 examples ofresults for 2 different type of point clouds. First columnshows the results from a single Kinect point cloud. Secondcolumn, an example of result on a point cloud obtained usinga co-calibrated laser scanner is displayed. First row showsthe input colored point clouds and second row the full-3Dbottom-up viewpoint-independent saliency maps. This figureshows two crucial advantages of the proposed model over anyexisting 2D or 2.5D saliency model: (1) the ability to workon any kind of structured or unstructured point cloud and(2) the ability to provide viewpoint-free 3D saliency mapswhich might be adapted to any given viewpoint.

E. Quantitative Results

Figure 4 shows the results of the validation. Concerningthe comparison with the 2.5D models, SR3D outperforms allthe other models.

V. CONCLUSIONSIn this article, we proposed a new full-3D model of

saliency based on supervoxels extraction and their colorrarity for analyzing organized or unorganized point clouds.The major contribution is the use of supervoxels which let usestimate the salience of large point clouds with a simple andefficient saliency mechanism. Our method has been validatedon multiple recent models of saliency using weighted F-measure, a recent optimized measurement method. SR3Deasily exceeds the depth-extended 2D saliency models. Thoseresults are achieved with a good computational performance(for example with one level of supervoxels decomposition,a Kinect cloud can be computed in less than 1.5 secondon a recent Intel Core I7), using only the color featurewhich shows an important performance increase potentialwhen adding depth, voxel orientation and other 3D featuresand providing viewpoint-independent full-3D saliency mapswhich can be scruted under different viewpoints by adaptingthe centered Gaussian and depth to a given viewpoint.

REFERENCES

[1] Arridhana Ciptadi, Tucker Hermans, and James Rehg, “An InDepth View of Saliency,” Procedings of the British Machine VisionConference 2013, pp. 112.1–112.11, 2013.

[2] Karthik Desingh, Madhava Krishna K, Deepu Rajan, and Cv Jawahar,“Depth really Matters: Improving Visual Salient Region Detection withDepth,” Procedings of the British Machine Vision Conference 2013,pp. 98.1–98.11, 2013.

[3] Congyan Lang, Tam V. Nguyen, Harish Katti, Karthik Yadati, MohanKankanhalli, and Shuicheng Yan, “Depth matters: Influence of depthcues on visual saliency,” Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes inBioinformatics), vol. 7573 LNCS, pp. 101–115, 2012.

[4] Nicolas Riche, Matei Mancas, Bernard Gosselin, and Thierry Dutoit,“3D Saliency for abnormal motion selection: The role of the depthmap,” Computer Vision Systems, 2011.

[5] Chang Ha Lee, Amitabh Varshney, and David W. Jacobs, “Meshsaliency,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 659,July 2005.

[6] Yu-Shen S Liu, Min Liu, Daisuke Kihara, and Karthik Ramani,“Salient critical points for meshes,” Proceedings of the 2007 ACMsymposium on Solid and physical modeling, pp. 277—-282, 2007.

[7] Rui Wang, Cuiyun Gao, Junli Chen, and Wanggen Wan, “Saliencymap in 3D point cloud and its simplification application,” Journal ofComputational Information Systems, vol. 10, no. 61301027, pp. 3553–3560, 2014.

[8] Elizabeth Shtrom, George Leifman, and Ayellet Tal, “SaliencyDetection in Large Point Sets,” 2013 IEEE International Conferenceon Computer Vision, pp. 3591–3598, Dec. 2013.

[9] Nicolas Riche, Matei Mancas, Matthieu Duvinage, Makiese Mibulu-mukini, Bernard Gosselin, and Thierry Dutoit, “Rare2012: A multi-scale rarity-based saliency detection with its comparative statisticalanalysis,” Sig. Proc.: Image Comm., vol. 28, no. 6, pp. 642–658,2013.

[10] N. Ouerhani and H. Hugli, “Computing visual attention from scenedepth,” in Pattern Recognition, 2000. Proceedings. 15th InternationalConference on, 2000, vol. 1, pp. 375–378 vol.1.

[11] L Itti, C Koch, and E Niebur, “A model of saliency-based visualattention for rapid scene analysis,” IEEE Transactions on patternanalysis and Machine Intelligence, pp. 0–4, 1998.

[12] Tilke Judd, Fredo Durand, and Antonio Torralba, “A benchmark ofcomputational models of saliency to predict human fixations,” 2012.

[13] Stas Goferman, “Context-aware saliency detection,” Pattern Analysisand Machine Intelligence, vol. 34, no. 10, pp. 1915–26, Oct. 2012.

[14] Julien Leroy, Nicolas Riche, Matei Mancas, Bernard Gosselin, andThierry Dutoit, “Superrare:an object-oriented saliency algorithm basedon superpixels rarity,” 2013.

[15] Jeremie Papon, Alexey Abramov, Markus Schoeler, and FlorentinWorgotter, “Voxel Cloud Connectivity Segmentation - Supervoxelsfor Point Clouds,” 2013 IEEE Conference on Computer Vision andPattern Recognition, pp. 2027–2034, June 2013.

[16] Borrmann Dorit, Elseberg Jan, Houshiar HamidReza, and NuchterAndreas, “Robotic 3d scan repository,” .

[17] Laurent Itti and Christof Koch, “A comparison of feature combinationstrategies for saliency-based visual attention systems,” Journal ofElectronic Imaging, vol. 10, pp. 161–169, 1999.

[18] Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal, “How to evaluateforeground maps?,” IEEE Conference on Computer Vision and PatternRecognition, 2014.

3d saliency based on supervoxels rarity in point clouds

Documents