math-shu 236 spectral clustering, graph laplaciansling/math-shu-236-2020-spring/math-sh… ·...

12
MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation of k-means We apply k-means to three different examples and see how it works. (a) Gaussian mixture model: the general form of Gaussian mixture model has its pdf as f (x; {μ i , Σ i ,w i })= k X i=1 w i N (μ i , Σ i ), k X i=1 w i =1. GMM is a popular data generative model, often used for comparing the performance of different clustering methods. We can see k-means works quite well for data gener- ated from GMM, see Figure 1. -6 -4 -2 0 2 4 6 -5 -4 -3 -2 -1 0 1 2 3 4 5 -6 -4 -2 0 2 4 6 -5 -4 -3 -2 -1 0 1 2 3 4 5 Figure 1: Gaussian mixture model (b) Data on two concentric circles. The obvious difference from GMM is that the data have underlying geometrical structures. Moreover, the two clusters are not linearly- separable, i.e., you cannot find a line to separate one circle from the other. Ideally, we hope to recover both circles while k-means does not perform as expected. k-means outputs the clusters by simply the cutting both circles into halves with a line. In fact, it is not the fault of k-means since the resulting clustering gives a smaller k-means objective function value. In other words, the clustering shown in Figure 2 is preferred by k-means. 1

Upload: others

Post on 07-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

MATH-SHU 236Spectral Clustering, Graph Laplacian

Shuyang Ling

March 11, 2020

1 Limitation of k-means

We apply k-means to three different examples and see how it works.

(a) Gaussian mixture model: the general form of Gaussian mixture model has its pdf as

f(x; {µi,Σi, wi}) =k∑i=1

wiN (µi,Σi),k∑i=1

wi = 1.

GMM is a popular data generative model, often used for comparing the performanceof different clustering methods. We can see k-means works quite well for data gener-ated from GMM, see Figure 1.

-6 -4 -2 0 2 4 6

-5

-4

-3

-2

-1

0

1

2

3

4

5

-6 -4 -2 0 2 4 6

-5

-4

-3

-2

-1

0

1

2

3

4

5

Figure 1: Gaussian mixture model

(b) Data on two concentric circles. The obvious difference from GMM is that the datahave underlying geometrical structures. Moreover, the two clusters are not linearly-separable, i.e., you cannot find a line to separate one circle from the other. Ideally,we hope to recover both circles while k-means does not perform as expected. k-meansoutputs the clusters by simply the cutting both circles into halves with a line. In fact,it is not the fault of k-means since the resulting clustering gives a smaller k-meansobjective function value. In other words, the clustering shown in Figure 2 is preferredby k-means.

1

Page 2: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

-4 -3 -2 -1 0 1 2 3 4

-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4

-3

-2

-1

0

1

2

3

Figure 2: Data on two concentric circles

(c) Data on two “long bars”. It is highly anisotropic compared with the first example:each cluster is not isotropic, i.e., the variance in each direction is approximately thesame. Instead each cluster spreads more in y-axis than in x-axis. In this case, evenif they are linearly separable, k-means still does not give the desired result, shown inFigure 3.

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

15

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

15

Figure 3: Anisotropic data

What are the pros and cons of k-means?

1. Pros: easy to implement; works well for dataset whose clusters are well separatedand near-isotropic.

2. Cons: cannot handle data with geometry or anisotropic dataset.

2

Page 3: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

2 Spectral clustering

Spectral clustering is a graph-based method which uses the eigenvectors of the graphLaplacian derived from the given data to partition the data. It outperforms k-meanssince it can capture “the geometry of data” and the local structure. Let’s first give thealgorithm and then explain what each step means. We refer the interested readers to anexcellent review on this topic [3].

2.1 Algorithm of spectral clustering

The spectral clustering algorithm takes following steps:

1. Step 1: Construct an undirected weighted graph, represented by the weight matrixW = (wij)1≤i,j≤n, from the data {xi}ni=1, i.e., we view each xi as one vertex onthe graph and wij is the edge weight between xi and xj. The rule of thumb is:the larger wij means more similarities/associations between xi and xj. Here arecommonly-used examples:

(a) The σ-neighborhood graph:

wij =

{1, ‖xi − xj‖ ≤ σ

0, otherwise.

If xi and xj are close to each other, then wij = 1; otherwise, wij = 0.

(b) The fully connected graph: wij = exp(−‖xi−xj‖2

2σ2

)where σ is the width of the

neighborhoods. If xi is close to xj, then wij is close to 1. On the other hand, ifxi is far away from xj (what “far away” means depends strongly on the choiceof σ), then wij would be close to 0.

2. Step 2: Given the weight matrix, we obtain the graph Laplacian associated withW

L = D −W

or the normalized graph Laplacian

L = In −D−1/2WD−1/2 = D−1/2LD−1/2

where D is the degree matrix (also diagonal) whose diagonal entries are Dii =∑nj=1 wij > 0and wij ≥ 0.

3

Page 4: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

3. Step 3: Then we compute the eigenvectors ϕ1, · · · ,ϕk, corresponding to the firstsmallest k eigenvectors of L or L, i.e.,

Lϕl = λlϕl, 1 ≤ l ≤ n

where {ϕl}nl=1 is an orthonormal basis in Rn and the eigenvalues are

λ1 ≤ λ2 ≤ · · · ≤ λn.

Let Φ ∈ Rn×k be a matrix consisting of {ϕl}kl=1, i.e., the first k eigenvectors of L,

Φ(xi) := (ϕ1(xi), · · · , ϕk(xi))> ∈ Rk. (2.1)

In other words, we transform the original data xi from Rn to Rk through the firstk eigenvectors of L: it is a nonlinear transformation. The ith row of Φ representsthe ith data point. This step is also called Laplacian eigenmap, which is the keystep in spectral clustering.

4. Step 4: rounding procedure. Apply K-means clustering to the rows of Φ to groupthe data into k clusters.

2.2 Numerical experiments

Now we apply spectral clustering (based on graph Laplacian instead of normalized Lapla-cian) to the three aforementioned datasets and see how it works. Here we first use a re-sult (which will be proven later): graph Laplacian is always positive semidefinite and itssmallest value is 0 with the corresponding eigenvector 1n. Therefore, when applying theLaplacian eigenmap (2.1), only the second smallest eigenvector ϕ2 is used. Here we plotthe second smallest eigenvector w.r.t. the index of each node, as well as the clusteringresult.

0 100 200 300 400 500 600 700 800 900 1000

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

-6 -4 -2 0 2 4

-4

-3

-2

-1

0

1

2

3

4

Figure 4: Performance of spectral clustering applied to GMM

4

Page 5: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

0 100 200 300 400 500 600 700 800 900 1000

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

-4 -3 -2 -1 0 1 2 3 4

-3

-2

-1

0

1

2

3

Figure 5: Performance of spectral clustering

0 100 200 300 400 500 600 700 800 900 1000

-0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

15

Figure 6: Performance of spectral clustering

3 Graph Laplacian

Graph Laplacian plays an important role in the spectral clustering. In other words,spectral clustering is an example of kernel k-means where the kernel is given by theeigenvector of graph Laplacian, a.k.a. Laplacian eigenmap. The key question is: whyLaplacian eigenmap works? To answer this, we first need to take a close look at graphLaplacian. What is graph Laplacian? In fact, graph Laplacian is the core ingredientin spectral graph theory [1, 2], i.e., use spectra of graph Laplacian to analyze the topolog-ical structure of graphs. It also has numerous applications in areas including probabilitytheory, computer science, statistics, and machine learning.

Given any weighted undirected graph with weight matrix W ∈ Rn×n. The weight matrixmay not be necessarily generated from data as what we’ve done in spectral clustering.In particular, if W ∈ {0, 1}n×n is symmetric, then it is an adjacency matrix: node i andj are connected if wij = 1; disconnected if wij = 0. Any networks can be represented inthe form of adjacency matrix.

Graph Laplacian is defined asL = D −W (3.1)

where di =∑n

j=1 wij is the ith diagonal entry of the degree matrix D and wij = wji. The

5

Page 6: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

degree matrix and Laplacian have the following explicit form:

D =

d1 0 · · · 00 d2 · · · 0...

.... . .

...0 0 · · · dn

, L =

d1 − w11 −w12 · · · −w1n

−w21 d2 − w22 · · · −w2n...

.... . .

...−wn1 −wn2 · · · dn − wnn

.The Laplacian matrix is also a diagonal dominating matrix, i.e., the absolute value ofits diagonal entry is also greater than or equal to the sum of all entries in its row orcolumn.

Now, we take a look at several important properties of graph Laplacian.

(a) Graph Laplacian is always positive semidefinite.

Proof: it suffices to show that z>Lz ≥ 0 for any z ∈ Rn. In fact, the quadraticform has a very simple form:

z>Lz =1

2

∑i,j

wij(zi − zj)2. (3.2)

The proof follows from straightforward calculation:

z>Lz = z>Dz − z>Wz

=n∑i=1

diz2i −

∑i,j

wijzizj

=n∑i=1

n∑j=1

wijz2i −

∑i,j

wijzizj

=∑i,j

wij2

(z2i + z2

j )−∑i,j

wijzizj

=1

2

∑i,j

wij(zi − zj)2 ≥ 0

where wij ≥ 0. One can also prove this statement by using Gershgorin circle theorem.

(b) The smallest eigenvalue λ1(L) is 0 with eigenvector equal to constant vector.

Proof: Note that 1>nL1n = 0 holds if z = 1n. It means that 1n is the eigenvectorw.r.t the smallest eigenvalue. In fact, by the definition of D

L1n = (D −W )1n = D1n −W1n = d− d = 0

where d = [d1, · · · , dn]>.

This explains why only the second smallest eigenvector is needed when grouping thedata into two clusters.

(c) The second smallest eigenvalue λ2(L) is positive if and only if the graph is connected.The second smallest eigenvalue is called Fiedler eigenvalue and its correspondingeigenvector is Fiedler eigenvector.

A graph is connected if for any two vertices on the graph, there exists a path linkingthese two vertices and each edge on this path has a positive edge weight.

6

Page 7: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

Proof: “Connectivity ⇐= λ2(L) > 0”. We prove by contradiction: show that ifthe graph is not connected, then λ2(L) = 0, i.e., the nullity is at least 2. Supposethe graph is disconnected, then the graph must have two subgraphs and there existno edges between these two subgraphs. For the subgraph i, it has a weight matrixWi ∈ Rni×ni where ni is the number of vertices in the subgraph i and n = n1 + n2.The weight matrix of the full graph must be in the following block-diagonal form:

W =

[W1 00 W2

], L =

[L1 00 L2

]where Li = Di −Wi ∈ Rni×ni where Di is the degree matrix of Wi.

Consider the following two vectors:

z1 =

[1n1

0

]∈ Rn, z2 =

[0

1n2

]∈ Rn

Now left-multiply L to z1:

Lz1 =

[L1 00 L2

] [1n1

0

]=

[L11n1

0

]= 0.

The same argument applies to Lz2 = 0. Therefore, both z1 and z2 are in the nullspace of L, i.e., λ2(L) = 0.

“Connectivity =⇒ λ2(L) > 0”. Suppose λ2(L) = 0, then the null space is of dimen-sion at least 2. We can find two vectors in the null space 1n and ϕ2, and 1n ⊥ ϕ2.Why? For simplicity, we denote ϕ2 by ϕ.

Since 1>nϕ =∑n

i=1 ϕi = 0, there must exist some positive and negative entries in ϕ.Let Γ = {i : ϕ(i) ≥ 0} and Γc = {i : ϕ(i) < 0}. It is actually a partition of nodes [n].Now, we consider this: remember ϕ is in the null space of L, thus

ϕ>Lϕ =1

2

∑i,j

wij(ϕi − ϕj)2 = 0.

Decompose the summation into parts:

ϕ>Lϕ =1

2

( ∑i∈Γ,j∈Γ

+∑

i∈Γc,j∈Γ

+∑

i∈Γ,j∈Γc

+∑

i∈Γc,j∈Γc

)wij(ϕi − ϕj)2

=1

2

( ∑i∈Γ,j∈Γ

+2∑

i∈Γ,j∈Γc

+∑

i∈Γc,j∈Γc

)wij(ϕi − ϕj)2 = 0.

Note that each component is nonnegative but their sum is 0. It is equivalent to∑i∈Γ,j∈Γ

wij(ϕi − ϕj)2 = 0,∑

i∈Γc,j∈Γc

wij(ϕi − ϕj)2 = 0

Note that on Γ and Γc, φi has different signs. Thus

ϕi = ϕj, ∀(i, j) ∈ Γ or Γc.

Moreover, we have ∑i∈Γ,j∈Γc

wij(ϕi − ϕj)2 = 0.

7

Page 8: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

where ϕi−ϕj > 0 for all i ∈ Γ, j ∈ Γc. Thus wij = 0 for all i ∈ Γ, j ∈ Γc. This meansthe weight matrix is in the form of

W =

[WΓ 00 WΓc

]which is equivalent to the graph having two disjoint subgraphs.

(d) The number of connected components in the graph is equal to the dimension of nullspace of L. The spectra of L is the union of Laplacian spectra of each connectedcomponent. The proof is similar to (c). Try to prove it by yourself.

Exercise: The operator norm of L is bounded by ‖L‖ ≤ 2‖D‖ = 2 max1≤i≤n di. Findan example where ‖L‖ = 2‖D‖. (The graph is bipartite.) ‖ · ‖ is operator norm.

Exercise: Prove the Property (d).

3.1 Examples

Find the Laplacian to the following important graphs and its eigenvalues.

1. n-complete graph: each node is connected to all the other nodes.

The adjacency matrix is:

A =

0 1 1 · · · 11 0 1 · · · 11 1 0 · · · 1...

......

. . ....

1 1 1 · · · 0

, D =

n− 1 0 0 · · · 0

0 n− 1 0 · · · 00 0 n− 1 · · · 0...

......

. . ....

0 0 0 · · · n− 1

The Laplacian is

L =

n− 1 −1 −1 · · · −1−1 n− 1 −1 · · · −1−1 −1 n− 1 · · · −1...

......

. . ....

−1 −1 −1 · · · n− 1

= nIn − 1n1>n

Question: What are the eigenvalues and eigenvectors?

The eigenvalue is 0 with multiplicity 1 and n with multiplicity n−1. The eigenvectorfor eigenvalue 0 is 1n; any vector which is perpendicular to 1n is an eigenvector w.r.t.eigenvalue n.

2. n-path: the graph is simply a line with n nodes, denoted by {1, 2, 3, · · · , n}, asshown in Figure 7.

The adjacency matrix is:

A =

0 1 0 · · · 0 0 01 0 1 · · · 0 0 00 1 0 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 0 1 00 0 0 · · · 1 0 10 0 0 · · · 0 1 0

, D =

1 0 0 · · · 0 0 00 2 0 · · · 0 0 00 0 2 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 2 0 00 0 0 · · · 0 2 00 0 0 · · · 0 0 1

8

Page 9: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

0 2 4 6 8 10 12 14 16 18 20

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 7: n-cycle

The Laplacian is

L = D −A =

1 −1 0 · · · 0 0 0−1 2 −1 · · · 0 0 00 −1 2 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 2 −1 00 0 0 · · · −1 2 −10 0 0 · · · 0 −1 1

0 2 4 6 8 10 12 14 16 18 20

0

0.5

1

1.5

2

2.5

3

3.5

4

0 2 4 6 8 10 12 14 16 18 20

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Figure 8: Laplacian spectra of n-path

The Laplacian on the n-path graph is actually the discretization of second orderdifferential operator on unit interval with Neumann boundary condition.

The eigenvalues are

λk = 2− 2 cosπk

n, 0 ≤ k ≤ n− 1.

3. n-cycle is shown in Figure 10.

9

Page 10: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

-1 -0.5 0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Figure 9: n-cycle

The adjacency matrix is:

A =

0 1 0 · · · 0 0 11 0 1 · · · 0 0 00 1 0 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 0 1 00 0 0 · · · 1 0 11 0 0 · · · 0 1 0

, D =

2 0 0 · · · 0 0 00 2 0 · · · 0 0 00 0 2 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 2 0 00 0 0 · · · 0 2 00 0 0 · · · 0 0 2

The Laplacian is

L =

2 −1 0 · · · 0 0 −1−1 2 −1 · · · 0 0 00 −1 2 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 2 −1 00 0 0 · · · −1 2 −1−1 0 0 · · · 0 −1 2

The Laplacian on the n-cycle graph is actually the discretization of second orderdifferential operator on unit interval with periodic Neumann boundary condition.

Can we find an explicit form for the eigenvectors and eigenvalues?

(a) The eigenvalues are

λk = 2− 2 cos2πk

n, 0 ≤ k ≤ n− 1.

(b) The corresponding eigenvectors are

ϕk =

[cos

2πk

n, cos

4πk

n, · · · , cos

2(n− 1)πk

n

]>.

where ϕk = ϕn−k.

Exercise: Verify that discrete sine and cosine functions are the eigenvectors of graphLaplacian associated to n-cycle.

Exercise: Find the spectra of Laplacian associated to n-hypercube.

10

Page 11: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

0 2 4 6 8 10 12 14 16 18 20

0

0.5

1

1.5

2

2.5

3

3.5

4

0 2 4 6 8 10 12 14 16 18 20

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Figure 10: n-cycle

3.2 Normalized graph Laplacian

The normalized graph Laplacian associated to weight matrix W ∈ Rn×n is definedas

L := In −D−1/2WD−1/2 = D−1/2LD−1/2 (3.3)

where di =∑n

j=1wij > 0 is the ith diagonal entry of the degree matrix D and wij =wji.

L =

1− w11

d1− w12√

d1d2· · · − w1n√

d1dn

− w21√d1d2

1− w22

d2· · · − w2n√

d2dn...

.... . .

...− wn1√

d1dn− wn2√

d2dn· · · 1− wnn

dn

∈ Rn×n.

Consider the following true and false question:

(a) Is L positive semidefinite?

The answer is yes.

Proof: Let z be any vector in Rn; then

z>Lz = z>D−1/2LD−1/2z = (D−1/2z)>LD−1/2z

=1

2

n∑i=1,j=1

wij((D−1/2z)i − (D−1/2z)j)

2

=1

2

n∑i=1,j=1

wij

(zi√di− zj√

dj

)2

where (D−1/2z)i = zi/√di.

(b) Is 1n in the null space of L?

The answer is no. Instead, D1/21n is in the null space of L.

(c) The graph is connected if and only if λ2(L) > 0. True or false.

True. Prove it using the same argument as in the scenario of graph Laplacian.

11

Page 12: MATH-SHU 236 Spectral Clustering, Graph Laplaciansling/MATH-SHU-236-2020-SPRING/MATH-SH… · MATH-SHU 236 Spectral Clustering, Graph Laplacian Shuyang Ling March 11, 2020 1 Limitation

(d) The spectra of L and L are the same.

False. Why?

Exercise: The operator norm of L is bounded by 2. Find an example when ‖L‖ = 2.What property does the corresponding graph have?

References

[1] F. R. Chung and F. C. Graham. Spectral Graph Theory. Number 92. AmericanMathematical Soc., 1997.

[2] D. Spielman. Spectral graph theory, Lecture notes. 2019.

[3] U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing,17(4):395–416, 2007.

12