the hidden convexity of spectral clusteringweb.cse.ohio-state.edu/~rademacher.10/mlpaper.pdf ·...

The Hidden Convexity of Spectral Clustering

Mikhail Belkin Luis Rademacher James Voss

December 8, 2013

Abstract

In recent years spectral clustering has become a standard method for data anal-ysis used in a broad range of applications. In this paper we propose a new class ofalgorithms for multiway spectral clustering based on optimization of a certain class offunctions over a sphere. These algorithms can be interpreted geometrically as recov-ering a discrete weighted simplex. The proposed methods have some resemblance toIndependent Components Analysis and involve optimization of a certain “contrast func-tion” over a sphere. However, in our case the theoretical guarantees can be providedfor a much broader class of contrast functions under a “hidden convexity” condition.The proposed algorithms are very simple to implement, efficient and, unlike most of theexisting algorithms for multiclass spectral clustering, are not initialization-dependent.

Partitioning a dataset into classes based on a similarity between data points, known ascluster analysis, is one of the most basic and practically important problems in data analysisand machine learning. It has a vast array of applications from speech recognition to imageanalysis to bioinformatics and to data compression. There is an extensive literature on thesubject, including a number of different methodologies as well as their various practical andtheoretical aspects [6].

In recent years spectral clustering, which refers to various methods based on eigenvectorsof a certain matrix, typically the graph Laplacian, constructed from data, has become awidely used method for cluster analysis. This is due to both the simplicity of the algorithm,a number of desirable properties it exhibits, as well as its amenability to theoretical analy-sis. In its simplest form, spectral bi-partitioning is an attractively straightforward algorithmbased on thresholding the second bottom eigenvector of the Laplacian matrix of a graph.However, the more practically significant problem of multi-way spectral clustering is con-siderably more complex. While hierarchical methods based on a sequence of binary splitshave been used, the most common approaches use k-means or weighted k-means clusteringin the spectral space or related iterative procedures [11, 9, 2, 14] The typical algorithm formulti-way spectral clustering can be summarized as a two-step process:

1. Spectral embedding. A similarity graph for the data is constructed based on the featurerepresentation of the data. If one is looking for k clusters, one takes the embeddingusing the bottom k eigenvectors of the graph Laplacian (normalized or un-normalized)corresponding to that graph.

2. Clustering: In the second step, the embedded data (sometimes rescaled) is clustered,typically, using the conventional k-means algorithm or its variation.

1

The first step, the spectral embedding given by the eigenvectors of Laplacian matrices,has a number of interpretations. The meaning can be explained by spectral graph theory asrelaxations of multiway cut problems [12]. In the extreme case of a similarity graph havingk connected components, the embedded vectors reside in Rk and vectors corresponding tothe same connected component will be mapped to a single point. There are also connectionsto other areas of machine learning and mathematics, in particular to the geometry of theunderlying space from which the data is sampled [3].

While the existing approaches to clustering the resulting embedded data (the secondstep) have some theoretical justification ([2, 14]), the existing theoretical analyses typicallyassume that the minimum of the energy function can be found. However, the actual algo-rithms resemble k-means clustering and the output quality is initialization dependent, asthey can become trapped in local minima.

In our paper we propose a new class of algorithms for the second step of multi-wayspectral clustering. The starting point is that in the ideal case, where k clusters are perfectlyseparate, the spectral embedding using the bottom k eigenvectors has a particularly simplegeometric form. For the un-normalized (or asymmetric normalized) Laplacian it is simplya discrete weighted (k − 1)-dimensional simplex, and recovering its vertices is sufficientfor cluster identification. This observation was first made in [13] (see also [7] for someapplications), where the authors also proposed an optimization procedure to recover thesimplex vertices. For the normalized Laplacian the structure is slightly more complex, butis, as it turns out, still suitable for our analysis.

The approach taken in our paper relies on an optimization problem resembling that ofIndependent Component Analysis (see [4] for a broad overview). Specifically, we show theproblem of identifying k clusters reduces to maximizing a certain “contrast function” overa (k − 1)-sphere. We derive a simple mathematical characterization to describe a largespace of admissible contrast functions. It turns out that these functions have a certain“hidden convexity” property (see Section 3) and, moreover, that this property is necessaryand sufficient for guaranteed recovery. Specifically, each local maxima of such functions ona sphere corresponds to exactly one cluster in the data.

Interestingly, most functions used in the usual ICA setting (with the exception of thecumulants) do not have the analogous theoretical guarantees.

We provide algorithms for maximizing these functions, including running time guaran-tees as well as some experimental results.

Finally, we note the connections to some recent work on applying the method of mo-ments. In [5], one of the results shows recovery of a simplex using the moments of orderthree and can be thought as a special case of our approach. Another recent work [1] alsouses the moment method to recover a continuous simplex given samples from the uniformprobability distribution.

1 Problem Statement

Let G = (V,A) denote a similarity graph where V is a set of n vertices, and A is a non-negative weighted adjacency matrix. Two vertices i, j ∈ V are incident when aij > 0, andthe value of aij is interpreted as a measure of the similarity between 2 vertices. In spectralclustering, the goal is to partition the vertices of a graph into sets S1, . . . ,Sm such that

2

these sets form natural clusters in the graph. In the most basic setting, G consists of mconnected components, and the natural clusters should be the components themselves. Inthis case, for i′ ∈ Si, j′ ∈ Sj , ai′j′ = 0 when i 6= j. For convenience, we can consider thevertices of V to be numbered such that all indices in Si precede all indices in Sj when i < j.A takes on the form:

A =

AS1 0 · · · 0

0 AS2 · · · 0...

.... . .

...0 0 · · · ASm

(1)

a block diagonal matrix. In this setting, spectral clustering can be viewed as a technique forreorganizing a given similarity matrix A into such a block diagonal matrix. We will referto this setting in which G consists of m connected components as the idealized spectralclustering problem.

In practice, G rarely consists of m truly distinct connected components. Instead, onetypically observes a matrix A = A+E where E is some error matrix with (hopefully small)entries eij . Thus, for i and j in different clusters, all that can be said is that aij should besmall. The goal of spectral clustering is to permute the rows and columns of A to form amatrix which is nearly block diagonal and to recover the corresponding clusters. Whereas asimple depth first search would suffice in the idealized setting for spectral clustering, we ingeneral require an algorithm which is meaningful under a perturbation. We present a novelalgorithm for spectral clustering and demonstrate algorithmic correctness in the idealizedsetting. The algorithm is tested on real data, demonstrating that it has practical value.

1.1 A note on notation

Before proceeding, we define some notations used throughout the paper. [k] indicates theset {1, 2, . . . , k} where k is some positive integer. For a matrix B, bij indicates its ijth

element. bi· is the ith row vector of B, and b·j is the jth column vector of B. For a vector v,‖v‖ denotes the standard Euclidean 2-norm of v. Given two vectors u and v, 〈u, v〉 denotesthe inner produce between the vectors, i.e. the standard dot product. 1S is the indicatorvector for the set S, i.e. the vector which is 1 for indices in S and 0 otherwise. For a matrixM , N (M) denotes the null space of M . Sm−1 denotes the unit sphere in Rm. C(r)(X )denotes the set of r-times continuously differentiable functions defined on the domain X .∠(u, v) is the angle in the domain [0, π] between the vectors u and v. Angles are given inradians. Finally, given X a subspace of Rm, PX denotes the square orthogonal projectionmatrix from Rm to X .

2 Simplex Structure of the Graph Laplacian’s Null Space

Given a similarity graph G = (V,A), define the diagonal degree matrix D with entriesdii =

∑j∈V aij . The unnormalized Graph Laplacian is defined as L = D−A. The following

well known property of the Graph Laplacian (see [12]) helps shed light on its importance:Given u ∈ Rn,

uTLu =1

2

∑i,j∈V

aij(ui − uj)2 . (2)

3

Matrix L is positive semi-definite as (2) cannot be negative. Vector u is a 0-eigenvector ofL (or equivalently, u ∈ N (L)) precisely when uTLu = 0. When G consists of m connectedcomponents with indices in the sets S1, . . . ,Sm, inspection of (2) gives that u ∈ N (L)precisely when u is piecewise constant on each part Si. In particular,

{|S1|−1/21S1 , . . . , |Sm|

−1/21Sm}

is an orthonormal basis for N (L).There are many possible choices of an orthonormal basis of N (L). We cannot assume

that any particular basis of N (L) will provide indicators of the various classes. However, asobserved in [13], any orthogonal basis of N (L) generates a simplex structure which spatiallyseparates the connected components of G. More formally:

Lemma 1. Let the similarity graph G contain m connected components, and let L bethe unnormalized graph Laplacian of G. Then, N (L) has dimensionality m. Let X =(x·1, . . . , x·m) contain m scaled, orthogonal vectors forming a basis of N (L) in its columnssuch that each ‖x·i‖ =

√n. Then, there exists m mutually orthogonal vectors Z1, . . . , Zm ∈

Rm such that whenever i ∈ Sj, the row vector xi· = ZTj . Further, ‖Zj‖2 = n|Sj |−1.

Proof. We define the matrix MSi := 1Si1TSi . PN (L) can be constructed from any orthonor-

mal basis of N (L). In particular, using the two bases {|S1|−1/21S1 , . . . , |Sm|

−1/21Sm} and

{ 1√nx·1, . . . ,

1√nx·m} yields:

PN (L) =

m∑i=1

|Si|−1MSi and PN (L) =1

nXXT .

Thus for i, j ∈ V , 1n〈xi·, xj·〉 = (PN (L))ij . In particular, if there exists k ∈ [m] such that

i, j ∈ Sk, then 1n〈xi·, xj·〉 = |Sk|−1. When i and j belong to separate clusters, then xi· ⊥ xj·.

For i, j in component k,

cos(∠(xi·, xj·)) =〈xi·, xj·〉‖xi·‖2‖xj·‖2

=|Sk|−1

|Sk|−1/2|Sk|−

1/2= 1 ,

gives that xi· and xj· are in the same direction. As they have the same magnitude as well,xi· and xj· coincide for any two indices i and j belonging to the same component of G.

Hence, there are m distinct vectors Z1, . . . , Zm corresponding to the m connected com-ponents of G such that xi· = ZT

k for each i ∈ Sk. Further, Zi ⊥ Zj when i 6= j, and‖Zi‖2 = n|Si|−1 for each i ∈ [m].

2.1 Interpreting the Simplex Structure of the Laplacian Embedding

Lemma 1 demonstrates that using the Null Space of the unnormalized graph Laplacian, them connected components in G are mapped to the m vertices of a simplex in Rm. Of course,under perturbation of the similarity matrix A, the interpretation of Lemma 1 must change.In particular, G will no longer consist of m connected components, and instead of usingonly vectors in N (L), X must be constructed using the eigenvectors corresponding to the

4

lowest m eigenvalues of L. With the perturbation of A comes a corresponding perturbationof these eigenvectors. When the perturbation is not too large, the resulting rows of X yieldm nearly orthogonal clouds of tightly clustered points.

Due to different cluster interpretations under perturbation, the following normalizedversions of the Graph Laplacian are also used for spectral clustering:

Lsym := D−1/2LD−

1/2

Lrw := D−1L.

Using either Lsym and Lrw in place of L when generating the eigenvector matrix X fromLemma 1 will be fully consistent with the proposed algorithms of this paper. In the idealizedsetting, N (Lrw) happens to be the same as N (L), making Lemma 1 and all subsequentresults in this paper equally applicable to Lrw. If Lsym is used, the simplex structurecreated in Lemma 1 is replaced by a slightly more involved ray structure. The admissibilityof Lsym will be discussed in a future version of this paper.

In the perturbed setting, it is natural to define a notion of best a clustering within thesimilarity graph. One way of doing this is to use graph cuts. For two sets of vertices S1 andS2, the cut is defined as Cut(S1,S2) :=

∑i∈S1, j∈S2 aij . In the m-way min cut problem, the

goal is to partition the vertices into m non-empty sets which minimize the cost:

CCut({S1, . . . ,Sm}) =

m∑i=1

Cut(Si,Sci ) (3)

Such a partition gives an “optimal” set of m clusters. However, this optimal set of clustersdoes not penalize small clusters, making it plausible that one would find an “optimal” cutwhich simply detaches m− 1 nearly isolated vertices or small clusters from the rest of thegraph.

In order to favor clusters of more equal size, one can consider variants of the cut problem.In particular, spectral clustering arises as a relaxation to two different NP-hard problems,namely m-way min normalized cut and m-way min ratio cut. In the ratio cut problem, oneminimizes the cost:

CRCut({S1, · · · ,Sm}) =

m∑i=1

Cut(Si,Sci )

|Si|

Alternatively, in the normalized m-way cut problem, one minimized the cost:

CNCut({S1, · · · ,Sm}) =m∑i=1

Cut(Si,Sci )∑j∈Si djj

Spectral clustering using L arises as a relaxation of the m-way ratio cut problem (see[12] section 5), whereas the use of Lsym arises as a relaxation of the m-way normalizedcut problem (see [14]). These interpretations do not give error bounds under perturbation,but do give some insight as to how clusters are formed. For instance, using Lsym discour-ages classifying a few sparsely connected points as a cluster since the normalization term∑

j∈Si djj for that cluster in CNCut is also small.

Alternatively, Lrw arises from the theory of random walks. The matrix D−1A has rows ofunit norm, allowing the entry (D−1A)ij to be interpreted as the transition probability from

5

state i to state j in a Markov random chain. In this setting, the notion of m clusters is recastas having a random walk with m invariant aggregates. In the idealized spectral clusteringproblem, there are m sets S1, . . . ,Sm of states such that the probability of transitioningbetween states belonging to distinct sets is 0. Creating X as in Lemma 1 using Lrw in placeof L maps each state within an aggregate to a single simplex point.

3 Simplex Optimization Landscape and the Hidden Convex-ity

We continue our analysis in the idealized setting.Given a graph G with m connected components, let X; S1, . . . ,Sm; and Z1, . . . , Zm be

constructed from L as in Lemma 1. Then, the simplex vertices Z1, . . . , Zm are mutuallyorthogonal in Rm. Let wi := n−1|Si|. Thus, a wi fraction of the row vectors of X indexed as

xk· coincide with ZTi , and ‖Zi‖ = w

−1/2i . We denote the unit vector Zi

‖Zi‖ by Zi. It suffices

to recover the directions Zi up to sign in order to cluster the points. That is, a spectralpoint xj· ∈ Si corresponding to the jth vertex in G lies on the line through ±Zi and theorigin, making these lines correspond to the vertex clusters.

We consider an approach related to FastICA which optimizes “arbitrary” symmetricfunctions over directional projections of the data. Let f : Sm−1 → R be defined on the unitsphere in terms of a “contrast” function g : R→ R as follows:

f(u) :=1

n

n∑i=1

g(〈u, xi·〉) (4)

This can equivalently be written:

f(u) =m∑i=1

wig(uTZi) . (5)

It will suffice that g meets the following assumptions:

A1. g is continuous on R.

A2. g is symmetric on R.

A3. Function g(√x) is strictly convex on [0,∞).

The function f constructed via (4) from any function g which satisfies A1-A3 can beused to recover the spectral simplex vertices up to sign:

Theorem 2. Let f : Sm−1 → R be defined in terms of a contrast function g : R → Rsatisfying assumptions A1–A3. Then, the set {±Zi : i ∈ [m]} is a complete enumeration oflocal maxima of f . Moreover, those local maxima are strict.

We now prove Theorem 2. The proof proceeds by establishing sufficient and neces-sary conditions in a series of Lemmas exploiting the convexity structure induced by thesubstitution ui 7→ u2i .

6

We will first need the following well known optimality condition for convex maximization(see [10, Chapter 32] for the background). Let K ⊆ Rm be a convex set. A point x ∈ K iscalled an extreme point of K iff x is not equal to a strict convex combination of two otherpoints of K. For a polytope, the set of extreme points is the set of vertices.

Lemma 3. Let K ⊆ Rn be a convex set. Let f : K → R be a strictly convex function. Thenthe set of local maxima of f relative to K is contained in the set of extreme points of K.

Proof. We prove the contrapositive. Suppose x ∈ K is not an extreme point. Then x =λu + (1 − λ)v for some λ ∈ (0, 1), u, v ∈ K. Without loss of generality we can translateeverything so that x = 0. By strict convexity of f we have f(0) < max{f(u), f(v)}. Thesame argument shows f(0) < max{f(εu), f(εv)} for any ε ∈ (0, 1]. Making ε→ 0 gives thatx = 0 is not a relative local maximum.

Lemma 4. Let hi : [0,∞) → R, i ∈ [m] be a family of strictly convex functions. LetH : [0,∞)m → R be given by H(x) =

∑i hi(xi). Let wi > 0 for i ∈ [m]. Then the set

of local maxima of H relative to the scaled simplex conv{wiei}mi=1 is contained in the set{wiei}mi=1.

Proof. We have that H is strictly convex in its domain and {wiei}mi=1 is the set of extremepoints of conv{wiei}mi=1. The result follows immediately from Lemma 3.

Lemma 5 (Necessary optimality condition). Let gi : [0,∞)→ R, i = 1, . . . ,m be such thatt 7→ gi(

√t) is strictly convex in [0,∞). Let f : Rm → R be given by f(x) =

∑i gi(|xi|).

Then the set of local maxima of f relative to Sm−1 is contained in {±ei : i ∈ [m]}.

Proof. Suppose z ∈ Sm−1 is not in {±ei : i ∈ [m]}. Without loss of generality we assumez ≥ 0. Let hi : [0,∞) → R, i ∈ [m] be given by hi(t) = g(

√t). Let H : [0,∞)m → R be

given by H(y) =∑

i hi(yi), as in Lemma 4. By definition we have that H is a sum of convexfunctions and hence is convex. Moreover, it is easy to see that it is strictly convex in itsdomain. The map G : [0,∞)m → [0,∞)m given by x 7→ G(x) = (x2i ) is a homeomorphismbetween Sm−1 ∩ [0,∞)m and ∆m−1 := conv{ei : i ∈ [m]}. We also have f(x) = H(G(x))in Sn−1. This implies that the properties of being a local maximum of H relative to ∆m−1

and being a local maximum of f relative to Sm−1 ∩ [0,∞)m are preserved under G andlocal maxima are in one-to-one correspondence. As G(z) is not a canonical vector, Lemma4 implies that G(z) is not a local maximum of H relative to ∆m−1. Thus, z is not a localmaximum of f relative to Sm−1 ∩ [0,∞)m, in particular, also not relative to Sm−1.

Lemma 6. Let h : [0,∞) → R be a strictly convex function. Let wi > 0 for i ∈ [m]. LetH : [0,∞)m → R be given by H(x) =

∑iwihi(xi/wi). Then the set {ei}mi=1 is contained in

the set of strict local maxima of H relative to ∆m−1.

Proof. By symmetry, it is enough to show that e1 is a strict local maximum. Let hi(t) =wih(t/wi). In this way we have H(x) =

∑i hi(xi). We need to understand the behavior of

h1 around 1 and h2, . . . , hm around 0. To this end and to take advantage of strict convexity,we will consider a (two piece) piecewise affine interpolating upper bound to each hi. Wewill pick ti ∈ (0, 1) so that the approximation is affine in [0, ti] and [ti, 1] with ti so that theslope of the left piece is the same for all i. The slope of the right piece is always larger than

7

the slope of the left piece by strict convexity. We make these choices more precise now. Letwmax = max{wi : i ∈ [m]}. Let ti = wi

2mmax{1,wmax} . We have 0 < ti ≤ 1/2m. The piecewiseaffine upper bound to hi is the interpolant through 0, ti, 1. The left piece has slope

hi(ti)− hi(0)

ti=wih( 1

2mmax{1,wmax})− wih(0)wi

2mmax{1,wmax}=h( 1

2mmax{1,wmax})− h(0)

12mmax{1,wmax}

,

which is independent of i and we denote ml. The right piece of the interpolant of hi hasa slope that we denote mi. By strict convexity we have mi > ml. The fact that theinterpolating pieces are upper bounds implies the following inequalities:

hi(x) < mlxi + wih(0) for x ∈ (0, ti),

hi(x) < hi(1)−mi(1− x) for x ∈ (ti, 1).(6)

Let y ∈ ∆m−1 be such that yi ≤ ti for i = 2, . . . ,m. This implies y1 ≥ 1/2. Also, the set ofthose y is a neighborhood of e1 relative to ∆m−1. Putting everything together,

H(e1)−H(y) = h1(1) +m∑i=2

wih(0)−m∑i=1

hi(yi)

≥ h1(1) +m∑i=2

wih(0)− [h1(1)−m1(1− y1) +m∑i=2

(mlyi + wih(0))]

(using (6))

= m1(1− y1)−ml

m∑i=2

yi

= (m1 −ml)(1− y1)> 0.

Lemma 7 (Sufficient optimality condition). Let g : [0,∞) → R be such that t 7→ g(√t)

is strictly convex in [0,∞). Let wi > 0 for i ∈ [m]. Let f : Rm → R be given by f(x) =∑iwig(|xi|/

√wi). Then the set {±ei : i ∈ [m]} is contained in the set of strict local maxima

of G relative to Sm−1.

Proof. By symmetry, it is enough to show that e1 is a strict local maximum of f relativeto Sm−1 ∩ [0,∞)m. The homeomorphism of Lemma 5 implies that it is enough to showthat e1 is a strict local maximum of H(x) =

∑iwih(xi/wi) relative to ∆m−1. This follows

immediately from Lemma 6.

The proof of the main result (Theorem 2) now follows quite easily:

Proof of Theorem 2. That B = {±Zi : i ∈ [m]} give strict local maxima of f follows fromLemma 7. To see that f has no other local maxima besides those in B, use Lemma 5.

8

3.1 Necessity of Assumption A3

We show next that when gi = g in Lemma 5, the strict convexity of t 7→ g(√t) is essentially

necessary to rule out additional local maxima for some choice of weights. For clarity, thenext argument uses derivatives, giving an insubstantial gap between the necessary and thesufficient conditions.

Proposition 8. Let g : [0,∞) → R be a continuous function such that g is twice contin-

uously differentiable in (a, b) for some 0 < a < b and such that d2

dt2g(√t) < 0 for some

t ∈ (a, b). Then there exist weights v, w > 0 so that f(x, y) = vg(|x|/√v) + wg(|y|/

√w)

has a strict local maximum relative to S1 that is not in {±ei : i ∈ [n]}. More precisely, ifv = w = 1/(2t), then (x, y) = (1/

√2, 1/√

2) is a strict local maximum of f relative to S1.

Proof. We use the homeomorphism from the proof of Lemma 5. Let h(x) = g(√x). It is

enough to show that 1/2 is a strict local maximum of

H(x) = vh(x/v) + wh((1− x)/w)

in [0, 1]. With our choice of weights v and w, we have H ′(x) = h′(x/v)− h′((1− x)/w) andH ′(1/2) = 0. Similarly, H ′′(x) = h′′(x/v)/v + h′′((1 − x)/w)/w and H ′′(1/2) = h′′(t)/v +h′′(t)/w < 0. This completes the proof.

4 Spectral Clustering Algorithms

4.1 Choosing a Contrast Functions

We first consider the function g2(y) = y2 which fails to satisfy Assumption A3. Takingf2(u) = 1

n

∑ni=1 g2(〈u, xi·〉), a quick computation demonstrates that f2 is the constant func-

tion on the unit sphere:

f2(u) =∑i∈[m]

wi(uTZi)

2

=∑i∈[m]

wiw−1i (uT Zi)

2 = ‖u‖22 = 1

Defining f using g2 clearly fails because it cannot distinguish between the directions ofthe unit sphere to find point clusters. From assumption A3, admissible contrast functions

grow more quickly than g2 in the sense that x 7→ g′(x)x should be an increasing function.

As a result, admissible contrasts will be more greatly affected by clusters containing pointswhich are far away from the origin. As the distance of cluster i from the origin is 1√

wi= n|Si|

which is larger for smaller clusters, small clusters will have higher maxima. When choosinga contrast, there is a tradeoff between choosing a function which grows only slightly morequickly than g2 and hence has maxima of similar strength for each cluster, and choosinga function which grows far more quickly than g2 and hence satisfies assumption A3 morerobustly.

9

With this in mind, there are many functions g which satisfy A1–A3, including thefollowing:

ght(y) = (log cosh(y))2

gp(y) = |y|p where p ∈ (2,∞)

gabs(y) = −|y|

ggau(y) = e−y2.

In view of the ICA literature, we expect gabs and ght to be good choices for g. ght is an outerapproximation to y 7→ (12y)2 in the sense that ght approaches y 7→ (12y)2 asymptotically forlarge values of y. Due to this similarity, ght will not be heavily biased against larger clusters.

For both gabs and ght,g′(y)y approaches a constant as y →∞, making them grow similarly

to g2 in the limit. However, the structure of gabs is still very distinct from g2 making it agood, easy-to-compute, general contrast.

4.2 Algorthms

We now have all the tools needed to create a new class of algorithms for Spectral Cluster-ing. Given a similarity graph G = (V,A), define a graph Laplacian L as either L or Lrw

(reader’s choice). Then, viewing G as a perturbation of a graph consisting of m connectedcomponents, construct X ∈ Rn×m such that x·i gives the eigenvector corresponding to theith smallest eigenvalue of L. X can be constructed using an existing eigensolver.

With X in hand, choose a contrast function g which satisfies A1–A3. From g, thefunction f(u) =

∑ni=1 g(〈u, xi·〉) is defined on Sm−1 using the rows ofX. From the discussion

in section 3, the local maxima of f roughly correspond to the desired spectral clusters. Sincef is a symmetric function, if f has a local maximum at u, f also has a local maximum at−u. However, the directions u and −u correspond to the same line in Rm and form anequivalence class, with each such equivalence class corresponding to a cluster.

Our first goal is to find local maxima of f corresponding to distinct equivalence classes.Once we have obtained such local maxima u1, . . . , um of f , we cluster the vertices of G byplacing vertex i in the jth cluster using the rule j = arg maxk |〈uk, xi·〉| to complete theclustering algorithm. When searching for the local maxima of f , we may also leverage thatthe local maxima of f belonging to different equivalence classes should be in approximatelyorthogonal directions. We sketch two algorithmic ideas below:

10

Algorithm 1: Finds the local maxima of f defined using the points xi· of the rows of X.The second input η is the learning rate (step size).

1: function FindOpt1(X, η)2: C ← {}3: for i← 1 to m do4: Draw u from Sm−1 ∩ span(C)⊥ uniformly at random.5: repeat6: u← u+ η(∇f(u)− uuT∇f(u)) (= u+ ηPu⊥∇f(u))7: u← Pspan(C)⊥u8: u← u

‖u‖9: until Convergence

10: Let C ← C ∪ {u}11: end for12: return C13: end function

FindOpt1 is a form of projected gradient ascent. The parameter η gives the learningrate. Each iteration of the repeat-until loop moves u in the direction of steepest ascent. Forgradient ascent in Rm, one would expect step 6 to read u← u+η∇f(u). However, gradientascent is being performed for a function f defined on the unit sphere, but the gradientdescribed by ∇f is for the function f defined on Rm. ∇f(u)− uuT∇f(u) is the projectionof ∇f onto the tangent plane of Sm−1 at u. This update keeps u near the sphere.

Vector u can be drawn from Sm−1 ∩ span(C)⊥ for instance by first drawing u fromSm−1, projecting onto span(C)⊥, and then normalizing u. It is important that u stay nearthe orthogonal complement of span(C) in order to converge to a new cluster rather thanconverging to a previously found optimum of f . Step 7 enforces this constraint during theupdate step.

The second algorithmic idea more directly uses the point separation implied by theorthogonal simplex structure of the approximate cluster centers. In particular, we observethat since the data points are close to some cluster center, the data points themselves canbe used as test points:

11

Algorithm 2: Finds the local maxima of f defined using the points xi· of the rows of X. Thesecond input δ controls how far a point needs to be from previously found cluster centersto be a candidate future cluster center.

1: function FindOpt2(X, δ)2: C ← {}3: while |C| < m do4: j ← arg maxi{f( xi·

‖xi·‖ ) : 〈 xi·‖xi·‖ , u〉 < 1− δ ∀u ∈ C}

5: C ← C ∪ { xj·‖xj·‖ }

6: end while7: return C8: end function

By pre-computing the values of f( xi·‖xi·‖ ) outside of the while loop, FindOpt2 can be

run in O(mn2) time. For large similarity graphs, FindOpt2 is likely to be slower thanFindOpt1 which takes O(m2nt) time where t is the average number of iterations to con-vergence. The number of clusters m cannot exceed (and is usually much smaller) than thenumber of vertices n.

FindOpt2 has a couple of nice features which may make it preferable especially onsmaller data sets. Each center found by FindOpt2 will always be within a cluster of datapoints even when the optimization landscape is distorted under perturbation. The maximafound by FindOpt2 also contain a more global outlook to the notion of a local maxima,which may also be important in the noisy setting. Finally, FindOpt2 is a fully deterministicalgorithm. However, the parameter δ > 0 needs to be chosen sufficiently large to encompassmost of the perturbation with respect to any cluster.

5 Simulation Data and Experiments

5.1 A Toy Example

In Figure 1, there are 4 diagrams which display the process of Spectral Clustering on a toyexample consisting of random points p1, p2, . . . , p1250 generated using 3 concentric circles.200 points were drawn uniformly at random from a radius 1 circle, 350 points from aradius 3 circle, and 700 points from a radius 5 circle. Then, the points were given a scalar,multiplicative perturbation.

From this data, a similarity matrixA was created according to the rule aij = exp(−14‖pi−

pj‖22). Self similarity is set to 0 by convention. The point indices were chosen during thegeneration of the data to make each cluster contiguous, making A block diagonal up tothe effects of inter-cluster similarities. Looking at the graphical depiction of the similaritymatrix in Figure 1 (c), one can see the diagonal block structure encoded via the similaritymatrix A.

12

−5 0 5

−6

−4

−2

0

2

4

6

Clustered Data

X

Y

(a)

−10

1

−101

−1

−0.5

0

0.5

1

1.5

2

Y

Rows of the Eigenspace

X

Z

(b)

i

j

Similarity Matrix A

200 400 600 800 1000 1200

200

400

600

800

1000

1200

(c)

i

j

coordinate−wise log of similarity matrix

200 400 600 800 1000 1200

200

400

600

800

1000

1200

(d)

Figure 1: An illustration of spectral clustering on a toy example consisting of data randomlygenerated on 3 noisy circles. (a) shows the input data with colors corresponding to the la-beling produced by spectral clustering. (b) uses the three eigenvectors of Lrw correspondingto the smallest three eigenvalues of Lrw to produce spectral data points (the rows of theresulting eigenvector matrix). These points are displayed with there final labelings given bytheir color, matching the colors from (a). The color map on the sphere in this picture cor-responds to the value of the contrast function ght(x) = (log coshx)2, with 3 rays protrudingout of the sphere corresponding to the local maxima found using FindOpt1. The sign onthe ray directions are chosen to go near the data clusters. (c) and (d) give scaled entries ofthe similarity matrix A = [aij ] and the log similarity matrix [log(aij)] respectively. In bothcases, black corresponds to high similarity values and white corresponds to low similarityvalues.

Despite each point having high similarity with some points in its own class, most pointshave low similarity even with other points from the same class. Figure 1 (d) (which showsthe log of the scaled similarity values for intensity) shows where the error from the idealized

13

case is present. There is a higher floor to the inter-class similarities than to the intra-classsimilarities of the radius 5 circle green class. This issue is most apparent in the greenclass and explains why the green class is most perturbed from being a single point in thesimplex structure in Figure 1 (b). The row vector data points were generated from thesmallest 3 eigenvectors of Lrw. The data form a reasonable simplex structure, and thecontrast function used (ght(x) = (log coshx)2) encodes 3 clean local maxima correspondingto the simplex directions. The class labels given by the point colors were recovered usingFindOpt1.

It is worth noting that since FindOpt1 strictly enforces orthogonality of the obtainedcluster center directions, the last direction chosen has no degrees of freedom to find a localmaximum and thus only corresponds to a maximum if the simplex structure holds exactly.This is most likely why the ray corresponding to the red class in Figure 1 (b) does notexactly go through either the local maximum of the contrast or the cluster of points. Itmay be tempting to correct for this by removing step 7 from FindOpt1 after achievinginitial convergence and continuing until a non-constrained convergence is achieved. Doingso runs the risk in practice of approaching a previously found cluster.

5.2 Spectral Clustering on Pictures

In addition to testing the proposed Spectral Clustering techniques against simulated data,the techniques were applied to the task of image segmentation on the BSDS300 image set(see [8]). The goal in image segmentation is to divide an image into regions which representdistinct objects or features of the image. A bottom up approach might be to first run edgedetectors across the image and then connect the edges to outline regions. Spectral clusteringallows for an approach using both the bottom-up and top-down perspectives simultaneously.Nearby pixels are labeled as being similar based on some local similarity measure, and thenusing a global perspective of the similarity, pixels are clustered.

While there are many ways of defining a similarity measure between nearby points, weused a relatively simple similarity measure based only on the color and location informationof the various pixels. Let pi denote the ith pixel. pi has location xi and RGB color ci =(ri, gi, bi)

T . We use the following similarity between any two distinct pixels pi and pj :

aij =

{e−

1α2‖xi−xj‖22e

− 1β2‖ci−cj‖22 if ‖xi − xj‖ < R

0 if ‖xi − xj‖ ≥ R.

for some parameters α, β, and radius R. Enforcing that aij is 0 for points which are not tooclose allows one to build a very sparse similarity matrix, greatly speeding up computations.As the similarity decays exponentially with distance, the zeroed entries would be close tozero anyway.

Determining the number of clusters to use in spectral clustering is an unsolved problem.However, the BSDS300 data set includes hand labeled segmentations. From the hand la-beled segmentations for a particular image, one human segmentation was chosen at randomand the number segments was used to determine the number of clusters m for spectralclustering. No other information from the human segmentations was used. Figure 2 showsa couple of the segmentations generated applying FindOpt1 on the BSDS300 data set. In

14

generating these segmentations, 9× 9 median filtering was applied before constructing thesimilarity matrix in order to reduce salt and pepper type noise.

Figure 2: Several images from the BSDS300 test set segmented using Lrw and the contrastgabs(x) = −|x|. Before constructing the similarity matrix, images underwent 9× 9 medianfiltering, which is intened to reduce salt and pepper type noise. Red pixels mark the bordersbetween segmented regions. Other than the use of these red pixels, the displayed imagesare unaltered from the input images.

References

[1] J. Anderson, N. Goyal, and L. Rademacher. Efficient learning of simplices. In COLT,pages 1020–1045, 2013.

[2] F. R. Bach and M. I. Jordan. Learning spectral clustering, with application to speechseparation. Journal of Machine Learning Research, 7:1963–2001, 2006.

[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and datarepresentation. Neural Comput., 15(6):1373–1396, 2003.

[4] P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent componentanalysis and applications. Access Online via Elsevier, 2010.

[5] D. Hsu and S. M. Kakade. Learning mixtures of spherical gaussians: moment methodsand spectral decompositions. In Proceedings of the 4th conference on Innovations inTheoretical Computer Science, pages 11–20. ACM, 2013.

[6] A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., UpperSaddle River, NJ, USA, 1988.

15

[7] P. Kumar, N. Narasimhan, and B. Ravindran. Spectral clustering as mapping to asimplex. 2013 ICML workshop on Spectral Learning, 2013.

[8] D. R. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented nat-ural images and its application to evaluating segmentation algorithms and measuringecological statistics. In ICCV, pages 416–425, 2001.

[9] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and analgorithm. Advances in neural information processing systems, 2:849–856, 2002.

[10] R. T. Rockafellar. Convex analysis. Princeton Landmarks in Mathematics. Prince-ton University Press, Princeton, NJ, 1997. Reprint of the 1970 original, PrincetonPaperbacks.

[11] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions onPattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

[12] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.

[13] M. Weber, W. Rungsarityotin, and A. Schliep. Perron cluster analysis and its connec-tion to graph partitioning for noisy data. Konrad-Zuse-Zentrum fur Informationstech-nik Berlin, 2004.

[14] S. X. Yu and J. Shi. Multiclass spectral clustering. In Computer Vision, 2003. Pro-ceedings. Ninth IEEE International Conference on, pages 313–319. IEEE, 2003.

16

the hidden convexity of spectral clusteringweb.cse.ohio-state.edu/~rademacher.10/mlpaper.pdf ·...

Documents