label transfer exploiting three-dimensional structure for ...this matching is usually accomplished...

Label Transfer exploiting three-dimensional structurefor Semantic Segmentation

Valeria GarroCS, University of Verona

Strada Le Grazie 15, 37134Verona (Italy)

[email protected]

Andrea Fusiello∗

DIEGM, University of UdineVia Delle Scienze 206, 33100

Udine (Italy)[email protected]

Silvio SavareseEECS, University of Michigan1301 Beal Avenue, Ann Arbor,

MI 48109-2122, [email protected]

ABSTRACTThis paper deals with the problem of computing a semanticsegmentation of an image via label transfer from an alreadylabeled image set. In particular it proposes a method thattakes advantage of sparse 3D structure to infer the categoryof superpixel in the novel image. The label assignment iscomputed by a Markov random field that has the superpix-els of the image as nodes. The data term combines labelingproposals from the appearance of the superpixel and fromthe 3D structure, while the pairwise term incorporates spa-tial context, both in the image and in 3D space. Exploratoryresults indicate that 3D structure, albeit sparse, improvesthe process of label transfer.

Categories and Subject DescriptorsI.4.6 [Image Processing And Computer Vision]: Seg-mentation—Pixel classification

KeywordsImage Parsing, Segmentation, Labeling, Image Understand-ing, Markov Random Fields, Structure From Motion

1. INTRODUCTIONSemantic image segmentation, also known as image pars-

ing or labeling image regions with their categories, has beenrecently tackled as a label transfer approach, which reducesthe inference problem for a new image to the problem ofmatching it to an existing set of labeled images. Whereasthis matching is usually accomplished by exploiting localsimilarity between images [11, 17, 15], in this paper we in-vestigate the leverage of sparse 3D information coming fromStructure from Motion to improve the transfer. What mo-tivates our work is that we expect 3D structure to be a

∗Part of this work has been carried out while a.f. was withthe University of Verona

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Mirage 2013 Berlin, GermanyCopyright 2013 ACM 978-1-4503-2023-8/13/06 ...$15.00.

powerful cue for matching, while its computation is becom-ing a “commodity” thanks to the availability of efficient androbust implementations [14, 7, 5]. In particular, in this workwe concentrate on urban environment application, that hasreceived much attention both in the segmentation [11, 15,16] and in the Structure from Motion literature [2].

[2D approach] Previous work on annotation transfer comefrom the field of object recognition, where some existingtechniques have been modified in order to deal specificallywith urban outdoor scenarios. To overcome the massivedataset dimension in [11] the authors present a method thatinvolves first a retrieval step on a large database of annotatedimages using a modified version of SIFT flow [10]. The sub-sequent steps are applied only on a small subset of imagespresenting similar structures alignment. This system inte-grates multiple cues in a Markov random field framework tosegment and recognize the query image. Even if the algo-rithm gives accurate results, the matching time for a pair ofimages with very small resolution (256 × 256) is around 30seconds, still not suitable for real-time applications.Zhang et al. in [17] proposed a similar procedure that in-volves retrieving multiple image sets of similar annotatedimages each of which can cover all semantic categories of thequery image. Then the dense correspondences between thesesets of images and the query are computed at superpixellevel to increase the computational efficiency. The matchingscheme, called KNN-MRF, does not rely only on the singlefeature appearance of each superpixel but it considers alsothe spatial relation between pairs of neighbor superpixels,modeling it as a Markov random field. After establishingthe correspondences, a pruning procedure is applied in or-der to remove the semantic incorrect matches using a trainedclassification model.Another interesting work has been proposed in [15]. Theauthors present a non-parametric scalable approach to im-age labeling at a superpixel level. For each superpixel of thedataset images, a large number of disparate features havebeen extracted describing its appearance. Unlike [17] it re-quires no training, it also combines semantic (e.g. sky, road,building) and geometric (e.g. vertical, horizontal) labelingunder the assumption that each semantic label is associatedwith a unique geometric label.Recently [6] proposed a solution that leverages on a graph ofdense overlapping patch correspondences across large datasetof images avoiding the needs for exhaustive pairwise compar-isons.

[3D approach] An early attempt of label transfer involving3D points coming from a Structure from Motion reconstruc-tion can be found in [13]. In that work a user’s annotation inone image, in the form of a rectangle with text, is transferredto the other images of the Structure from Motion datasetsimply by retrieving the 3D points whose 2D projections areinside the specified rectangle and then reprojecting them onthe other images. The authors apply an additional check in-volving visibility and scale information that avoids incorrectreprojections, but the label transfer is still rough, penalizedby tight visibility constraints and it’s restricted to the set ofimages involved in the Structure from Motion algorithm.We proposed a method that falls in between these two dif-ferent approaches exploiting the effectiveness given by theadditional 3D data and obtaining an high precision that im-proves image approaches. To the best of our knowledge ourwork is the first attempt to combine appearance-based labeltransfer with 3D points structure.The rest of the paper is organized as follows. The nextsection gives an overview of the algorithm; Section 3 de-scribes the pre-computation step of superpixel and featureextraction and Section 4 explains two different techniquesfor retrieving a subset of images in order to limit the maincomputation to a more specific and reliable dataset. In Sec-tion 5 we will describe in detail how we map our problemin a Markov Random Fields framework; Section 6 describesthe experiments and conclusions are drawn in Section 7.

2. METHOD OVERVIEWOur method integrates information from the appearance

of an image, captured in a urban environment, with thethree-dimensional structure of the scene in order to performthe label transfer. In a nutshell, the query image is firstsegmented into superpixels, then both information sources– namely, image appearance and 3D structure – propose alabeling for each superpixel; these proposals are combinedand then relaxed with a MRF to include spatial context.

It is assumed that a set of images D is available, capturinga specific urban environment, together with a sparse cloudof 3D points S representing the structure of the scene de-picted in D, as it is commonly produced by a Structure fromMotion pipeline like [14].Moreover, a subset of these images A ⊂ D is already la-beled, i.e. , it has been segmented into small regular andcoherent regions, called superpixels, and each superpixel hasbeen assigned to a specific semantic class. Additional avail-able data, obtained by Structure from Motion algorithm,is the camera pose estimation (the orientation and the 3Dposition of the camera center) of each image belonging toD.

The goal of our work is to label the images belongingto the set D \ A or another external image not includedin D, captured in the same scene environment. In suchcase, the procedure should rely on a localization system,as described in [8], that efficiently computes the externalcamera parameters (orientation and camera center) of theimage with respect to the existing reconstruction.From this point forward, let us call Iq, query image, the onethat is going to be labeled.

3. SUPERPIXELS PRECOMPUTATIONAs a pre-processing step, computed only once in advance,

the labeled images in A are segmented into superpixels us-ing the Simple Linear Iterative Clustering algorithm (SLIC)[1], which starts with a regular grid of cluster centers andthen locally clusters pixels in the combined five-dimensionalcolor and image plane space. The extracted superpixels arecharacterized by a compact and nearly uniform shape, asshown in Figure 1.

Figure 1: Example of superpixel extraction.

For each superpixel the algorithm extracts three featuresrelated to its location on the image and its color and textureappearance:

• the top height; linked to the relative “location” of thesuperpixels in the image (e.g. in urban scenarios “sky”superpixels will more likely appear on the top of theimage and “ground” superpixel on the bottom),

• the color histogram in the log-chromaticity space cor-related to the color appearance of the superpixel,

• an histogram of occurrences of the set of SURF featuredescriptors already extracted from the images duringthe structure from motion procedure that are relatedto keypoints inside the superpixels.

The number of features is much smaller than in [15], in which4 global and 13 superpixel features have been extracted. Themain reasons are that:

i) we do not need to compute global features, mainly re-lated to the image retrieval step, because the algorithmrelies only on the 3D structure for the computation ofthe “working reference subset” of most similar images.Section 4 describes in detail the procedure.

ii) due to the superpixels’ property to have a uniformshape there is no need to describe in detail both shapeand location proprieties, only one feature, in our casethe top height, can sufficiently characterize the super-pixels.

4. ACTIVE SETDealing with a potentially large input dataset of images

D taken in a huge urban environment like district areas orentire cities [4], we need an efficient method to extract the

effective subset of images Aq ⊂ A that capture the samescene portion as Iq (which we call the “active” set). Wedescribe now the two possible scenarios already mentionedin Section 2.

If Iq ∈ D we can define Aq exploiting the 3D points visi-bility, since Iq shares the visibility of a common portion ofthe 3D points cloud. Let us call Sq ⊂ S the set of 3D pointsvisible from the query image Iq; we reproject the 3D pointsin Sq onto the images belonging to A and we define Aq asthe subset of A in which the 3D projection gives valid val-ues (i.e., at least a subset of Sq is inside the cameras’ viewfrustum of Aq).

If Iq /∈ D the image should be initially localized as alreadymentioned in Section 2. The first stage of a typical localiza-tion system consists on a retrieval procedure of most similarimages Dq that is done using image information applyinga Bag-of-word approach [12] and can be refined involvingalso a 3D structure constraint [8]. Aq is then defined as theintersection Dq ∩A

5. MRF INFERENCEThe label assignment of Iq is computed by a Markov Ran-

dom Field (MRF) that has the superpixels of Iq as nodes.An edge is set between two nodes if the corresponding super-pixels are adjacent. The set of superpixels {si} is indicatedwith V while E represents the set of edges between neighbornodes {(si, sj)}

5.1 Unary TermThe unary term of the MRF encodes labeling proposals

coming from the actual data, both in terms of appearanceof superpixels and 3D structure.

5.1.1 Appearance basedFor each superpixel si of Iq we assign a score at each label

l based on on image appearance, as in [15].The probability P

(fki |l

)of observing fki , the k-th feature

of superpixel si, given the label l can be estimated empiri-cally from the frequency of features from the given label inthe neighborhood of fki (in feature space). Specifically, letNki denote the set of all superpixels in Aq whose k-th feature

distance from fki is below a fixed threshold. Then:

Pa(fki |l

)=

1 + n(l, Nk

i

)1 + n (l, Aq)

(1)

where n(l, Z) denotes the number of superpixels in set Zassociated to label l.

Making the custom assumption that the extracted fea-tures are independent of each other given the label, the like-lihood ratio for label l and superpixel si is:

Λa(l, si) =P (si|l)P (si|l)

=∏k

P(fki |l

)P(fki |l

) (2)

where l is the set of all labels excluding l and Pa(fki |l) iscomputed likewise.

5.1.2 Structure basedIn order to exploit the additional information given by the

3D structure, we establish a relationship φ that maps super-pixels from of Iq to superpixels in Aq via the 3D structure(details in Sec. 5.1.3). Thanks to φ we can compute Π(si),the set of points of Sq that are projected onto superpixel

si. Points in Π(si) carry forward the label of the associatedsuperpixel in Aq, so every point in Π(si) casts a vote for thelabel of si; the probability of si having label l (according tothe structure) is estimated by count as:

Ps(l|si) =1 + n(l,Π(si))

1 + n(∗,Π(si))(3)

where n(∗,Π(si)) is simply the cardinality of Π(si).We combine the two terms Λa(l, si) and Ps(l|si) after con-

verting to an energy to obtain the unary (data) term:

Ed(si) = − log Λa(si, l)− α logPs(si|l). (4)

The α coefficient modulates the relative influence of appear-ance and geometry, and was set to α = 0.8 in our experi-ments.

5.1.3 Using the structureIn this section we shall discuss the method by which we

establish a relationship between a superpixel in Aq and asuperpixel of Iq via the 3D structure. The first step (per-formed once and for all) is to compute the visibility of 3Dpoints with respect to the superpixels in Aq. This is donewith ray-casting in a discretized volumetric space, where acell (or voxel) is deemed occupied if more than a given num-ber of points are inside. Marching along the ray from the 3Dpoint x to the center of the superpixel s (back-to-front), s isadded to the visibility set of x if all the cells on the path arefree (see Figure 2. Then for each superpixel s we computeΠ(s), the set of 3D points that projects onto s, according tothe visibility. The same procedure is applied to the super-pixel in Iq, and in this way a map φ from a superpixel s ∈ Iqto a set (possibly empty) of superpixels of Aq is established:

φ(s) = {si ∈ Aq | Π(si) ∩Π(s) 6= ∅} (5)

Figure 2: Illustration of the ray casting procedure.The path from the corner of the church to the cam-era position (cross) is obstructed by the statue, sothe ray tracing (light-blue squares) stops there.

This geometric way of handling occlusions fails if the oc-clusor have not been reconstructed (no 3D points are at-tached to it). In order to cope with this problem we alsoperform a photoconsistency check between superpixels thathave been associated via φ. In other words, superpixel smust be photoconsistent (i.e. , have similar appearance)with the superpixels in φ(s). The photoconsistency of twosuperpixels si ∈ Iq and sj ∈ φ(si) is evaluated as the chi-square distance between two-dimensional histograms of the

log-chromaticity representation of the superpixels. In for-mula:

χ2(hi, hj) =1

2

∑m

(hi(m)− hj(m))2

hi(m) + hj(m)(6)

where hi is the color histogram of si transformed in the log-chromaticity space and normalized. The log-chromaticity[3] transformation of an RGB image is defined as:

{R,G,B} → {log(R/G), log(B/G)}. (7)

After transforming in log-chromaticity space, two patchesrepresenting the same surface under different illumination(in color and intensity) are related by a linear transformthat we factor out by subtracting the mean and dividingby the standard deviation of the log-chromaticity values.After such normalization, two patches representing the samesurface should have identical values.

5.2 Pairwise TermThe pairwise term of the MRF enforces the spatial con-

textual constraint on the label assignment. It differs fromthe classical constant Potts penalty because the cost is mod-ulated by a term which is inversely proportional to distancein 3D space, the rationale being that little or no cost ispayed for assigning different labels to superpixels that areassociated to 3D points that are far apart in space:

φ(la, lb) = βwdδ (la 6=lb) (8)

The distance term wd is defined as:

wd(si, sj)=

{e(−µ‖pi−pj‖) if Π(si) 6=∅ ∧Π(sj) 6=∅1 otherwise

(9)

where pi is the centroid of Π(si), and µ is a rate of decay thatdepends on the scale of the 3D structure. In our experimentswe set β and µ respectively to 10 and 4. The resultingsmoothness term is:

Es =∑

(si,sj)∈E

φsi,sj (10)

5.3 InferenceThe total energy cost function is the sum of the unary

terms (Eq.4) and pairwise terms (Eq.10):

E(s) =∑si∈V

(− log Λa(si, l)− α logPs(si|l).)︸︷︷︸unary term

+

+∑

(si,sj)∈E

φsi,sj︸︷︷︸pairwise term

In order to minimize this submodular function we decide toapply the tree-reweighted belief propagation algorithm [9],as its implementation lends itself well to code paralleliza-tion (in CUDA as well) in order to reach high efficiencyperformance. However, this implementation requires a reg-ular connectivity. While at the onset of SLIC every super-pixel has exactly four neighbors, during the evolution of theclusters in principle this connectivity may vary, althoughin practice it is mainly preserved. Therefore we decided toforce the connectivity to remain unchanged.

6. EVALUATIONSince no existing benchmark database are available that

meet our requirements, we created a brand new labeleddataset of 50 images with the related 3D points cloud recon-struction obtained with a Structure from Motion algorithm[5] whose implementation is available online1.

Figure 3: Top: sample images from the dataset.Bottom: a view of the 3D structure.

The depicted scene is a public square in front of a church,with a horse statue occluding the background in some par-ticular views2. This particular scenario has been chosen inorder to fully evaluate the 3D data potential. The cardinal-ity of S, the set of 3D points, in this dataset is about 70k,a view of the 3D structure can be seen at the bottom ofFigure 3, while some sample images are shown on the top.The original image resolution, used in the Structure fromMotion algorithm, is 2592× 1944 pixels but for the labelingprocedure the images have been scaled by a factor of fourin order to speed up the computation and to dampen thenumber of superpixels extracted. From each low resolutionimage (648 × 486) an average of 600 superpixels have beencomputed.

The ground truth data have been obtained by manuallylabeling each image of the dataset (in this case D = A) , as-signing to each superpixel one of the following labels: basestatue, statue’s pedestal, statue’s horse, scaffolding cover,woodboard, church, church’s cupola , ground, building, sky.Figure 4 shows an example of ground truth labeling, the por-tions of the image in grayscale are associated with no label.

1www.3dflow.net2Dataset will be available online

We tested our system with a leave-one-out approach. One

Figure 4: Example of hand-made labeling.

at a time each image has been removed from the dataset. AsIq ∈ D the experiment falls in the first of the two retrievalscenarios explained in Section 4, we reproject into A\ Iq the3D points visible from Iq in order to retrieve a subset of atmost 10 most similar images Aq that will be used for thelabeling procedure. In this way, in case of a dataset of moreconsiderable dimension, we do not have to employ the fea-ture data of the entire dataset during the nearest neighbourextraction, preserving the system efficiency.

We compared our proposed method to a partial implemen-tation that ignores the 3D data contributions and is similarin spirit to [15]. The performance of our system is presentedin terms of per-pixel classification rate, our proposed methodachieves 82.6% whereas the restricted version exploiting only2D data reaches 78.1%. As shown in Figure 5, due to the

Figure 5: Superpixels label frequencies in thedataset.

lack of uniformity in the dataset label distributions we re-port also the results in terms of per-pixel classification ratefor each label, see Figure 6. The average per-class rate is59.4% for the complete version of the proposed method and49.7% for the 2D restricted version.

Figure 6: Per-class classification rate.

The addition of the 3D data in the MRF clearly improvesthe labeling results, particularly in urban scenarios wheresome labels could be difficult to discern, like different build-ings with similar colors and textures appearance (e.g. “basestatue” and “church” ). This is clearly displayed also in Fig-ure 7 that shows a qualitative example of two labeling resultsrelated to the first two images of Fig. 3. For some labelsthere is not a strong gap in the classification rate, this is dueto the fact that the lack of 3D data, e.g. there can be no 3Dreconstruction of “ground” and “sky” and the “scaffoldingcover” are associated to textureless regions, the Structureform Motion pipeline in these regions of the images did notextract any SURF feature descriptors and as a consequenceit did not reconstruct a sufficient number of 3D points, asshown at the bottom of Figure 3.

The time cost of labeling a query image is summarized inTable 1 focusing the attention to each steps of the proposedsystem.

Table 1: Average time costSuperpixel extraction 2 sec

Feature extraction 23 sec

Visibility check 25 sec

Label inference 1 sec

The current version of the proposed system is implementedin MATLAB with no use of parallelization and optimizeddata structure; the computational bottlenecks are the Fea-ture Extraction and the Visibility check stages that could bespeed up exploiting the processing power of a graphic card.

7. DISCUSSIONThis paper has investigated the leveraging of 3D structure

in the process of label transfer. The method that has beenpresented takes advantage of i) already labeled images, andii) sparse 3D structure to infer the category of superpixelsin a novel image. Due to the laboriousness of creating an-notated datasets with 3d structure, experiments that havebeen reported are only exploratory, but they clearly indi-cate that the 3D structure, even if sparse, can improve theprocess of label transfer.

Figure 7: Two examples of labeling results. The amelioration brought by 3D structure can be appreciatedby comparing the left and right images of the second and third rows.

Future work will be aimed at probing the method on largerand different datasets, keeping focused on scalability.

AcknowledgmentsThanks to Roberto Toldo from 3Dflow srl for providing thestructure from motion code and other support.

8. REFERENCES[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua,

and S. SAijsstrunk. SLIC Superpixels Compared toState-of-the-art Superpixel Methods. IEEETransactions on Pattern Analysis and MachineIntelligence, 34(11):2274 – 2282, 2012.

[2] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla.Segmentation and recognition using structure frommotion point clouds. In Proceedings of the EuropeanConference on Computer Vision, 2008.

[3] G. D. Finlayson, S. D. Hordley, C. Lu, and M. S.Drew. Removing shadows from images. In Proceedingsof the European Conference on Computer Vision,pages 823–836, 2002.

[4] J. Frahm, P. Fite Georgel, D. Gallup, T. Johnson,R. Raguram, C. Wu, Y. Jen, E. Dunn, B. Clipp,S. Lazebnik, and M. Pollefeys. Building rome on acloudless day. In ECCV10, pages IV: 368–381, 2010.

[5] R. Gherardi, M. Farenzena, and A. Fusiello. Improvingthe efficiency of hierarchical structure-and-motion. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2010.

[6] S. Gould and Y. Zhang. Patchmatchgraph: Building agraph of dense patch correspondences for labeltransfer. In Proceedings of the European Conference onComputer Vision, 2012.

[7] A. Irschara, C. Zach, and H. Bischof. Towardswiki-based dense city modeling. In Proceedings of theInternational Conference on Computer Vision, pages1–8. IEEE, 2007.

[8] A. Irschara, C. Zach, J. Frahm, and H. Bischof. Fromstructure-from-motion point clouds to fast locationrecognition. In Proceedings of the InternationalConference on Computer Vision and PatternRecognition, pages 2599–2606, 2009.

[9] V. Kolmogorov. Convergent tree-reweighted messagepassing for energy minimization. IEEE Transactionson Pattern Analysis and Machine Intelligence,28(10):1568–1583, October 2006.

[10] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T.Freeman. Sift flow: Dense correspondence acrossdifferent scenes. In Proceedings of the 10th EuropeanConference on Computer Vision, ECCV ’08, pages28–42, Berlin, Heidelberg, 2008. Springer-Verlag.

[11] C. Liu, J. Yuen, and A. B. Torralba. Nonparametricscene parsing: Label transfer via dense scenealignment. In Proceedings of the InternationalConference on Computer Vision and PatternRecognition, pages 1972–1979, 2009.

[12] J. Sivic and A. Zisserman. Video Google: A textretrieval approach to object matching in videos. InProceedings of the International Conference onComputer Vision, volume 2, pages 1470–1477, Oct.2003.

[13] N. Snavely, S. M. Seitz, and R. Szeliski. Phototourism: Exploring photo collections in 3d. InSIGGRAPH Conference Proceedings, pages 835–846,New York, NY, USA, 2006. ACM Press.

[14] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling theworld from internet photo collections. InternationalJournal of Computer Vision, 80(2):189–210, November2008.

[15] J. Tighe and S. Lazebnik. Superparsing: scalablenonparametric image parsing with superpixels. InProceedings of the 11th European conference onComputer vision, ECCV’10, pages 352–365, Berlin,Heidelberg, 2010. Springer-Verlag.

[16] J. Xiao and L. Quan. Multiple view semanticsegmentation for street view images. In Proceedings ofthe International Conference on Computer Vision,2009.

[17] H. Zhang, J. Xiao, and L. Quan. Supervised labeltransfer for semantic segmentation of street scenes. InProceedings of the 11th European conference onComputer vision, ECCV’10, pages 561–574, Berlin,Heidelberg, 2010. Springer-Verlag.

label transfer exploiting three-dimensional structure for ...this matching is usually accomplished...

Documents