earth mover distance on superpixels - ucla mover distance on superpixels sylvain boltz 1; 2, frank...

4
EARTH MOVER DISTANCE ON SUPERPIXELS Sylvain Boltz 1,2 , Frank Nielsen 1 , Stefano Soatto 2 1 ´ Ecole Polytechnique, France 2 UCLA Vision Lab {boltz,nielsen}@lix.polytechnique.fr , [email protected] ABSTRACT Earth Mover Distance (EMD) is a popular distance to com- pute distances between Probability Density Functions (PDFs). It has been successfully applied in a wide selection of prob- lems of image processing. This success comes from two reasons, a physical one, since it computes a physical cost to transport an element of mass between two images or two histograms, and a statistical one, since it is a cross-bin metric (as opposed to a bin-wise metric). In computer vision, these features are useful since small variation of illuminance can shift the histogram. However, histograms are not a sufficient statistic to discriminate images since they ignore all geometric correlations. In addition, transport also called flow of an his- togram loose the information of geometric flow to warp one image on to an other. This paper proposes a new construction of EMD between images. This construction approximates the EMD between two images, by computing a pixel-wise trans- port at the complexity cost of computing an EMD between 1-D Histograms and preserves the geometrical and topolog- ical structure of the image. This construction simply relies on a segmentation of the image (also called superpixelization of the image). Results on matching on images shows the stability of the method even when the superpixelizations are highly inconsistent across images. Index TermsEarth Mover Distance, Wasserstein met- ric, Matching, Superpixel, Sparsity, Segmentation, 1. INTRODUCTION Image matching is a problem at the heart of many image pro- cessing and computer vision problems. Indeed, it is difficult to build an efficient matching cost between two images robust to changes of illumination and viewpoints. Several match- ing strategies exist in the literature. Some methods extract some key-points, like the popular SIFT, in the two images, and match them. Such strategies are highly sensitive to noisy key-point detections and to the outliers of matching model. Some other methods try to build an appearance template of the image, but these methods are whether too strict, for in- stance the 2 distance, whether not enough descriptive, for instance image histogram comparisons. In this paper, we ex- plore an approach in between, our approach builds on the seg- mentation of the images at multiple scales, also called super- pixelization tree. Those superpixels act as a local descriptor, and they contain the physical mass of the image, encoded in its size. The difficulty is then to match those segmentation across images of a video sequence. This problem looks simple since the image is now reduced to a subset of elements, but it is a hard combinatorial problem to solve. Moreover, since segmentation are highly inconsis- tent from one image to an other, there is ambiguity. Super- pixels can merge or split from one image to the other. The matching between both segmentations is thus not one to one but a continuous flow. Thus, the idea of using EMD as a way to compute this flow seems natural. Earth mover distance has already been used on image histograms [1, 2] or directly on the image pixels [3]. The first method is not discriminative enough since the geometric information on the image is lost. The second one is too complex since it solves a Partial Dif- ferential Equation (PDE) with an unconstrained flow on ev- ery pixel. This paper proposes an in-between approach. It approximates the pixel-wise flow between two images at the cost of comparing histograms. This is done by building a sub- set of pixels with weightings, obtained from a segmentation tree. This subset of pixels looks like an histogram with geo- metrical and topological information contained in the affinity distance matrix. Since 256 is a usual number for the number of bins in a 1-D histogram as well as a typical number for the number of superpixels in an image, the complexities are similar. The paper is organized as follows. In Section 2, we present the earth mover distance. Then, we show how the earth mover distance can be defined on segmentation trees and how to introduce topological consistency in Section 3. In the experimental Section 4, we show some experiments on consecutive images of video sequences. Finally we give some conclusions and perspectives in Section 5. 2. EARTH MOVER DISTANCE METRIC The Earth Mover Distance is the discrete way of writing the famous problem of optimal transport, also called the Wasser- stein metric or Monge-Kantorovich. It is a distance between probability density functions, or, on discrete data, histograms.

Upload: ledat

Post on 07-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EARTH MOVER DISTANCE ON SUPERPIXELS - UCLA MOVER DISTANCE ON SUPERPIXELS Sylvain Boltz 1; 2, Frank Nielsen , Stefano Soatto 1 Ecole Polytechnique, France´ 2 UCLA Vision Lab fboltz,nielseng@lix.polytechnique.fr

EARTH MOVER DISTANCE ON SUPERPIXELS

Sylvain Boltz1,2, Frank Nielsen1, Stefano Soatto2

1 Ecole Polytechnique, France 2 UCLA Vision Lab{boltz,nielsen}@lix.polytechnique.fr , [email protected]

ABSTRACT

Earth Mover Distance (EMD) is a popular distance to com-pute distances between Probability Density Functions (PDFs).It has been successfully applied in a wide selection of prob-lems of image processing. This success comes from tworeasons, a physical one, since it computes a physical costto transport an element of mass between two images or twohistograms, and a statistical one, since it is a cross-bin metric(as opposed to a bin-wise metric). In computer vision, thesefeatures are useful since small variation of illuminance canshift the histogram. However, histograms are not a sufficientstatistic to discriminate images since they ignore all geometriccorrelations. In addition, transport also called flow of an his-togram loose the information of geometric flow to warp oneimage on to an other. This paper proposes a new constructionof EMD between images. This construction approximates theEMD between two images, by computing a pixel-wise trans-port at the complexity cost of computing an EMD between1-D Histograms and preserves the geometrical and topolog-ical structure of the image. This construction simply relieson a segmentation of the image (also called superpixelizationof the image). Results on matching on images shows thestability of the method even when the superpixelizations arehighly inconsistent across images.

Index Terms— Earth Mover Distance, Wasserstein met-ric, Matching, Superpixel, Sparsity, Segmentation,

1. INTRODUCTION

Image matching is a problem at the heart of many image pro-cessing and computer vision problems. Indeed, it is difficultto build an efficient matching cost between two images robustto changes of illumination and viewpoints. Several match-ing strategies exist in the literature. Some methods extractsome key-points, like the popular SIFT, in the two images,and match them. Such strategies are highly sensitive to noisykey-point detections and to the outliers of matching model.Some other methods try to build an appearance template ofthe image, but these methods are whether too strict, for in-stance the `2 distance, whether not enough descriptive, forinstance image histogram comparisons. In this paper, we ex-plore an approach in between, our approach builds on the seg-

mentation of the images at multiple scales, also called super-pixelization tree. Those superpixels act as a local descriptor,and they contain the physical mass of the image, encoded inits size. The difficulty is then to match those segmentationacross images of a video sequence.

This problem looks simple since the image is now reducedto a subset of elements, but it is a hard combinatorial problemto solve. Moreover, since segmentation are highly inconsis-tent from one image to an other, there is ambiguity. Super-pixels can merge or split from one image to the other. Thematching between both segmentations is thus not one to onebut a continuous flow. Thus, the idea of using EMD as a wayto compute this flow seems natural. Earth mover distance hasalready been used on image histograms [1, 2] or directly onthe image pixels [3]. The first method is not discriminativeenough since the geometric information on the image is lost.The second one is too complex since it solves a Partial Dif-ferential Equation (PDE) with an unconstrained flow on ev-ery pixel. This paper proposes an in-between approach. Itapproximates the pixel-wise flow between two images at thecost of comparing histograms. This is done by building a sub-set of pixels with weightings, obtained from a segmentationtree. This subset of pixels looks like an histogram with geo-metrical and topological information contained in the affinitydistance matrix. Since 256 is a usual number for the numberof bins in a 1-D histogram as well as a typical number forthe number of superpixels in an image, the complexities aresimilar.

The paper is organized as follows. In Section 2, wepresent the earth mover distance. Then, we show how theearth mover distance can be defined on segmentation treesand how to introduce topological consistency in Section 3.In the experimental Section 4, we show some experimentson consecutive images of video sequences. Finally we givesome conclusions and perspectives in Section 5.

2. EARTH MOVER DISTANCE METRIC

The Earth Mover Distance is the discrete way of writing thefamous problem of optimal transport, also called the Wasser-stein metric or Monge-Kantorovich. It is a distance betweenprobability density functions, or, on discrete data, histograms.

Page 2: EARTH MOVER DISTANCE ON SUPERPIXELS - UCLA MOVER DISTANCE ON SUPERPIXELS Sylvain Boltz 1; 2, Frank Nielsen , Stefano Soatto 1 Ecole Polytechnique, France´ 2 UCLA Vision Lab fboltz,nielseng@lix.polytechnique.fr

Two histograms P and Q are given, as well as a dis-tance affinity matrix D(i, j). This matrix computes the costof transporting one element of mass (i.e. one pixel) of thei− th bin of P to the j− th bin of Q. It computes a flowmatrix F where F (i, j) is the amount of mass in the i−th binof histogram P transported to the j−th bin of histogram Q.The goal of optimal transport is then to find F that minimizesthe cost of every transports D(i, j) to warp histogram P tohistogram Q.

EMD(P,Q) = minF

∑i,j

F (i, j)D(i, j) (1)

The EMD gives two interesting outputs, the first one isthe distance value which gives a matching score between his-tograms. It has the physical meaning as the amount of massdisplaced. In the statistics community, compared to other fa-mous scores between histograms such as Kullback-Leibler di-vergence or Hellinger distance it is one of the only cross-bindistance. This means that it does not assume the bin valuesare correctly aligned as in bin-wise comparisons. This is aparticularly desired feature in computer vision since changesof illuminations or viewpoints can shift the values of the his-togram. However, as opposed to other distance between his-tograms, the complexity of EMD is higher since it has to solvea combinatorial problem of matching N bins to N other bins.Plus, it is designed to work efficiently on histograms whichare not a very discriminative feature of the image. Someworks have tried to solve the optimal transport directly on theimage pixels but it results in a complex PDE and brings newproblems since there is no regularity constraints in the flowF. The contribution of this paper is a way of computing thetransport on image pixels, with the complexity cost of match-ing 1-D histograms and without loosing the geometry and thetopological structure.

3. SUPERPIXEL-BASED HISTOGRAMS

3.1. Definition

Based on the ideas of coresets [4], we are trying to find subsetof pixels with different weightings as a representation for ourimage and still be able to compare the two transports. Thetwo transports are the EMD of the weighted subset of pixelsand the EMD on the original problem, the transport of all pix-els individually. In our approach, pixels are grouped togetherinto small regions called superpixels of different size. Sev-eral algorithms exist to build superpixelization of the image.Among them are the Quickshift algorithm [5], a variant of thefamous mean-shift algorithm. Another one is Statistical Re-gion Merging (SRM), a region merging technique [6]. Thegoal is now to compute the optimal transport of these super-pixel from one image to an other. We formulate this problemas an histogram matching problem, without loosing the geo-metric structure. One can define an histogram of an image

with as many bins as there are superpixels. Then, define themass inside each bin of this histogram as the superpixel size.

P (i) = |S1(i)| (2)

where S1(i) is the i−th superpixel of image 1 and |S1(i)| isits size in pixels. The cost D(i, j) of moving one element ofmass (i.e. one pixel) from the i−th bin of one histogram tothe j− th bin of the other histogram is the average coast ofmoving a pixel from one superpixel to another, i.e. the costof moving the mean pixel of the superpixel to the mean pixelof the other superpixel. Since the EMD transports pixels inthe geometric and radiometric space. The cost of moving onepixel is computed in a 5−D space: 3 −D for the colors and2−D for the geometric position.

D(i, j) = ‖S1(i)− S2(j)‖ (3)

where S1(i) is the 5-D mean inside superpixel S1(i). Bydefining such a cost, we follow directly the coreset idea of re-ducing the number of points but still trying to approximate theoptimal transport on the original problem (transport betweenindividual pixels). In addition, we gain an implicit regular-ization since the transporting flow of all the pixels inside onesuperpixel is constrained to be equal.

Finally, if one is not interested in approximating the EMDof the original problem, one could use as transport D(i, j)any distances between superpixels. For instance, one couldestimate a unimodal 5-D Gaussian inside each superpixel.The transport cost between two superpixels would be now be-tween unimodal Gaussian (known in closed form) [7]. In thiscase, ifN (µi,Σi) is the gaussian approximation of superpixelS1(i) andN (µj ,Σj) is the Gaussian approximation of super-pixel S2(j) in a 5−D space. Then the transport between thosetwo superpixels is defined as :

D(i, j)2 = ‖ µi − µj ‖2 + tr(Σi) + tr(Σj)

− 2tr(Σ1/2i ΣjΣ

1/2i )1/2 (4)

3.2. Including topological constraints

Building histograms on superpixels enforces some geometricstructure in the histograms. However, this constraint can beenhanced. In particular, when a segmentation tree is available,one would want to keep the topological structure of the tree inthe matching. For instance, imagine a good segmentation treeis provided, meaning that superpixels belonging to the sameobject are grouped together at one scale. Before matchingsuperpixels at small scales, which is a risky procedure, onecould force the ancestors at larger scale to match by definingthe following cost matrix D(i, j).

D(i, j) =∑s

‖S1,s(i)− S2,s(j)‖ (5)

Page 3: EARTH MOVER DISTANCE ON SUPERPIXELS - UCLA MOVER DISTANCE ON SUPERPIXELS Sylvain Boltz 1; 2, Frank Nielsen , Stefano Soatto 1 Ecole Polytechnique, France´ 2 UCLA Vision Lab fboltz,nielseng@lix.polytechnique.fr

Fig. 1. Topological constraints for robust matching. Fromleft to right, top to bottom: first image, second image, su-perpixelization of the first image at two different scales, su-perpixelization of the second image at two different scales.Toplogical constraints in EMD adds the cost of matching su-perpixelization at a coarse scale to the cost of matching su-perpixelization at a fine scale.

where S1,s(i) is the parent superpixel of S1(i) at scale s.The advantage of plugging the topological constraint in

matrix D is that it does not increase the complexity of thematching. Instead, one could design more accurate EMD bysolving the EMD at different scale and propagates the flow Ffrom one scale to an other.

Fig.1 shows an illustration of topological constraints. Inorder to match two superpixels at different scales, one sumsup the cost of matching the parent superpixels at a higherscale. In this way the topology of the matching is enforced.

3.3. Solving the EMD

Once the superpixel histograms P and Q with Eq.(2) arebuilt, and their pairwise distance between D(i, j) with Eq.(3)Eq.(4) or Eq.(5) are defined. One needs to find the EMDEq.(1), by minimizing the flow F (i, j). For this, we use thecode available from [8]. It provides both EMD value and theflow F . It runs in less than a second for the usual number ofsuperpixels we have to deal with in this paper. This algorithmuses a thresholding of the matrix D to speed up a max flow

Fig. 2. Matching superpixelizations, From left two right,top to bottom : first image, second image, superpixelizationof the first image (false color), superpixelization of the sec-ond image (false colors). Even between two images withsmall differences, the superpixelization, here in false col-ors, can be quite inconsistent. The matching of these twosuperpixel maps is in the color code: Superpixel i in im-age 2 have the color of the Superpixel in image 1 with labelj = arg maxF (j, i).

algorithm. This thresholding can be easily interpreted in ourframework since there is no need to compute D(i, j) betweenfar apart superpixels or superpixels with different ancestors.In this setting, topological constraints speeds up the com-plexity of the matching since it thresholds more distances inmatrix D.

4. EXPERIMENTAL RESULTS

On Fig. 2, we took two consecutive frames of a sequencefrom an optical flow benchmark. On both images, we performsingle scale superpixelization. As one can see, even betweenimages with small deformations, there is no consistence be-tween the two superpixel maps. The EMD solution is givenin the color code. The color of the superpixel in image 1 ischosen randomly. A superpixel i in image 2 has the color ofthe superpixel j = arg maxF (j, i) in image 1. This visual-ization is a partial representation of the flow, since the flow iscontinuous and we show only the best one to one match.

We perform a similar experiment on another video se-quence on Fig. 3. This video sequence “Football” is a difficultone for image matching since the player have similar colorsthan the people watching. Also there is motion blur due tofast motions.

Finally we show some quantitative experiments on Tab.1.We manually labeled 250 superpixels from two different im-ages. A one on one manual matching between each of the su-perpixels is provided. And we evaluate how well some algo-

Page 4: EARTH MOVER DISTANCE ON SUPERPIXELS - UCLA MOVER DISTANCE ON SUPERPIXELS Sylvain Boltz 1; 2, Frank Nielsen , Stefano Soatto 1 Ecole Polytechnique, France´ 2 UCLA Vision Lab fboltz,nielseng@lix.polytechnique.fr

Fig. 3. Matching superpixelizations, From left two right,top to bottom : first image, second image, superpixelizationof the first image (false color), superpixelization of the sec-ond image (false colors). Even between two images withsmall differences, the superpixelization, here in false col-ors, can be quite inconsistent. The matching of these twosuperpixel maps is in the color code: Superpixel i in im-age 2 have the color of the superpixel in image 1 with labelj = arg maxF (j, i).

rithms match the superpixels. Nearest Neighbor (NN) methodis the simplest matching. NN search of the superpixel in the5−D space, i.e. the minimum of Eq.(3). EMD is the algorithmbuilt on superpixels Eq.(3) and EMD-T is the EMD build onsuperpixels with topological constraints Eq.(5). Since earthmover distance does not give a one to one flow, first row se-lects the max of the flow for each superpixel (as in the previ-ous color coding), second row assumes a match as long as theflow between two superpixels is different from zero.

NN EMD EMD-TCorrect best match (in %) 71 91 94

Non zero flow (in %) - 93 97

Table 1. Manual matching of superpixels as reference, com-parison with Nearest Neighbor (NN) our Method (EMD) andour Method with topology consistant (EMD-T)

5. CONCLUSION

In this paper, we have proposed a new way of matching im-ages with EMD. It is expressed as an EMD built on super-pixels. The geometric and topological structures of the su-perpixels are taken into account to build the affinity distancematrix. This approach does not assign a one to one match butcomputes a flow of transport so the superpixels can split andmerge. This representation has a physical justification sinceit is computing the mass transport between different images.Future work will use this representation to track several re-gions on video sequences and to incorporate stability of thesegmentations as it has recently been studied [9]. Finally, be-ing able to track superpixels leads to many applications fromvideo segmentation [10] to action recognition.

6. REFERENCES

[1] T. Chan, S. Esedoglu, and K. Ni, “Histogram based segmenta-tion using Wasserstein distances,” in International Conferenceon Scale Space Methods and Variational Methods in ComputerVision, 2007, vol. 4485, p. 697.

[2] Y. Rubner, C. Tomasi, and L.J. Guibas, “The earth mover’sdistance as a metric for image retrieval,” International Journalof Computer Vision, vol. 40, no. 2, pp. 99–121, 2000.

[3] S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent, “Opti-mal mass transport for registration and warping,” InternationalJournal of Computer Vision, vol. 60, no. 3, pp. 225–240, 2004.

[4] M.R. Ackermann and J. Blomer, “Coresets and approximateclustering for Bregman divergences,” in Proceedings of theNineteenth Annual ACM-SIAM Symposium on Discrete Algo-rithms, 2009, pp. 1088–1097.

[5] A. Vedaldi and S. Soatto, “Quick shift and kernel methods formode seeking,” in European Conference on Computer Vision,2008, vol. IV, pp. 705–718.

[6] Richard Nock and Frank Nielsen, “Statistical region merging,”IEEE Transactions Pattern Analysis Machine Intelligence, vol.26, no. 11, pp. 1452–1458, 2004.

[7] H. Greenspan, G. Dvir, and Y. Rubner, “Region correspon-dence for image matching via EMD flow,” in Proceedingsof the IEEE Workshop on Content-based Access of Image andVideo Libraries, 2000, p. 27.

[8] Ofir Pele and Michael Werman, “Fast and robust earth mover’sdistances,” in IEEE International Conference on Computer Vi-sion, 2009.

[9] F. Chazal, L. J. Guibas, S. Y. Oudot, and P. Skraba,“Persistence-based clustering in Riemannian manifolds,” Re-search Report 6968, INRIA, June 2009.

[10] W. Brendel and S. Todorovic, “Video Object Segmentationby Tracking Regions,” in IEEE International Conference onComputer Vision, 2009.