spectral clustering - carnegie mellon universitynyc.lti.cs.cmu.edu/classes/11-745/f16/slides/yy...

1

Spectral Clustering(More General: Graph-based

Structured Learning)

Yiming Yang

Based on the tutorial by Ulrike von Luxburg, Max Planck Institute for Biological Cybernetics

@Yiming Yang, lecture on Spectral Clustering 1


Outline

Motivation

Graph Notation

Properties of Graph Laplacians

Spectral Clustering Algorithms & Empirical Findings (Ng, NIPS 2002; Shi, IEEE 2000)

Semi-supervised Learning with Graph Laplacians (R Johnson & T Zhang, JMLR 2007; Hanxiao Liu, 2016)

Transfer Learning with Graph Laplacians (our recent work under AAAI review)

2

Motivation: Graph-based Discovery

Finding the clusters

Connecting the dots

Predicting class labels (transductive learning)

Propagating beliefs (in social network)


Motivation: Graph Partitioning

Decomposing a computationally intractable graph into tractable subgraphs

Minimizing the connections among subgraphs;

Balancing the sizes of subgraphs.


3

Heterogeneous Networks


(from a slide by Jiawei Han)

Challenging Research Topics

Characterizing the (densities and) connectivity in a graph and its sub-graphs

Formulating the objective functions for the tasks of interest (clustering, classification, link prediction, etc.)

Algorithms for scalable optimization and robust prediction under noisy conditions

Generalization from single-graph to multi-graph based reasoning


4


Outline

Motivation

Graph Notation



Semi-supervised Learning with Graph Laplacians (R Johnson & T Zhang, JMLR 2007; Hanxiao Liu, 2016)


Graph Notation

𝐺 = 𝑉, 𝐸 , an undirected graph;

𝑉 = {𝑣𝑖} for 𝑖 = 1,2,⋯𝑛, the vertex set;

𝑊 = 𝑤𝑖𝑗 with 𝑖, 𝑗𝜖 1,2,⋯ , 𝑛 and 𝑤𝑖𝑗 ≥ 0, the weighted adjacency matrix;

𝐷 = 𝑑𝑖𝑎𝑔 𝑑1, 𝑑2,⋯ , 𝑑𝑛 , the degree matrix with 𝑑𝑖 = 𝑗=1𝑛 𝑤𝑖𝑗 ;

𝐿 ≡ 𝐷 −𝑊, the (unnormalized) graph Laplacian;

𝐿𝑠𝑦𝑚 ≡ 𝐷−1

2𝐿𝐷−1

2 ≡ 𝐼 − 𝐷−1

2 𝑊𝐷−1

2, the normalized symmetric graph

Laplacian;

𝐿𝑟𝑚 ≡ 𝐷−1𝐿 = 𝐼 − 𝐷−1𝑊, the normalized random walk graph Laplacian.


5

Graph Notation (cont’d)

𝐴 ⊂ 𝑉 a vertex subset, and its complement 𝐴 = 𝑉\A ;

𝕝𝐴 = 𝑓1, 𝑓2, ⋯ , 𝑓𝑛𝑇 ∈ 0,1 𝑛 is the indicator vector with 𝑓𝑖 = 1 if 𝑖 ∈ 𝐴

and 𝑓𝑖 = 0 otherwise;

Size 𝐴 , the number of vertices in 𝐴;

𝑣𝑜𝑙 𝐴 = 𝑖∈𝐴 𝑑𝑖 , the total weight of connected edges in 𝐴;

Set 𝐴 is a connected component if there is no connection between 𝐴 and 𝐴;

Sets 𝐴1, 𝐴2, ⋯ , 𝐴𝑘 is a partition of graph G if 𝐴𝑖 ∩ 𝐴𝑗 = ∅ and 𝐴1 ∪

𝐴2 ∪⋯∪ 𝐴𝑘 = 𝑉；


Graph Construction

𝜀-neighborhood graph

Connecting all pairs of points whose distances are smaller than 𝜀;

𝑘-nearest neighbor (kNN) graph

Connecting 𝑣𝑖 to 𝑣𝑗 if 𝑣𝑗 is among the kNN of 𝑣𝑖, or if they are among the

kNN of each other (the mutual kNN graph);

Fully connected graph

Connecting all points with some positive weights, e.g.,

𝑤𝑖𝑗 = 𝑒𝑥𝑝 −𝑥𝑖−𝑥𝑗

2

2𝜎2.


6

Graph Construction (cont’d)

Symmetric graph

𝐿𝑠𝑦𝑚 ≡ 𝐷−1

2𝐿𝐷−1

2 = 𝐷−1

2 𝐷 −𝑊 𝐷−1

2 = 𝐼 − 𝐷−1

2𝑊𝐷−1

2

W′ ≡ 𝐷−1

2𝑊𝐷−1

2, 𝑤′𝑖𝑗=𝑤𝑖𝑗

𝑑𝑖 𝑑𝑗

Random-walk graph

𝐿𝑟𝑚 ≡ 𝐷−1𝐿 = 𝐷−1(𝐷 −𝑊) = 𝐼 − 𝐷−1𝑊 ,

W′ = 𝐷−1𝑊, 𝑤′𝑖𝑗 =𝑤𝑖𝑗

𝑑𝑖=

𝑤𝑖𝑗

𝑖′=1𝑛 𝑤𝑖′𝑗

, 𝑗=1𝑛 𝑤′𝑖𝑗 = 1



Outline

Motivation

Graph Notation



Semi-supervised Learning with Graph Laplacians (Jerry Zhu 2005; Hanxiao Liu, 2016)


7

Unnormalized Graph Laplacian(in undirected G with non-negative weights)

𝐿 ≡ 𝐷 −𝑊 𝑊 ≥ 0

Property 1

∀𝑓𝜖ℝ𝑛, 𝑓𝑇𝐿𝑓 =1

2

𝑖,𝑗=1

𝑛

𝑤𝑖𝑗 𝑓𝑖 − 𝑓𝑗2

Proof : (page 3)

Intuition: Minimizing 𝑓𝑇𝐿𝑓 means that the connected nodes should have similar scores.


Unnormalized Graph Laplacian (cont’d)

Property 2. 𝐿 = 𝐷 −𝑊 is symmetric and positive semidefinite (PSD).

Why?


∀𝑓𝜖ℝ𝑛, 𝑓′𝐿𝑓 =1

2

𝑖,𝑗=1

𝑛

𝑤𝑖𝑗 𝑓𝑖 − 𝑓𝑗2≥ 0

8


Property 3. The smallest eigenvalue of L is 0, and the corresponding eigenvector is 𝕝 (the n-by-1 vector with all the elements equals to 1).

Property 4. L has n non-negative, real-valued eigenvalues 0 = 𝜆1 ≤ 𝜆2 ≤ ⋯𝜆𝑛.



Note: The graph Laplacian L does not depend on the diagonal elements of W.

Self-edges in a graph do not change the graph Laplacian.

Any matrix U which coincides with W on all off-diagonal positions share the same undirected graph Laplacian.

Spectral analysis on (the eigenvalues and eigenvectors of) graph Laplacian L only focuses on the links among different vertices.


9


Proposition 2 (Number of connected components): For an undirected graph G with non-negative weights, the multiplicity k of the eigenvalue 0 of L equals the number of connected components 𝐴1, 𝐴2, …, 𝐴𝑘, and the eigenspace of eigenvalue 0 is spanned by the indicator vectors 1𝐴1 ,1𝐴21 ,

…,1𝐴1𝑘 of those components.

Proof.

Starting from k = 1 (a fully connected graph) …



Proof for k > 1


(https://charlesmartin14.files.wordpress.com/2012/10/l1.png)

10


Implications

1) If two nodes belong to the same connected component (subgraph), then the will have the identical score (of 1) in an eigenvector that corresponding to the 0 eigenvalue of L.

2) We can use the k eigenvectors that correspond to the eigenvalue of 0 to identify the connected subgraphs.

3) Specifically, letting the k eigenvectors of L (or L_sym or L_rw) be the column vectors in matrix V (n-by-k), we can run k-means over the row vectors of V to obtain the clusters of rows (nodes in the graph).

4) Normalized graph Laplacians (L_sym and L_rw) have similar properties.



11


(Ng, 2001)

K-means

a – g: Proposed


Not checking if eigenvalue == 0 !

12


Concluding Remarks (w.r.t. Clustering)

Spectral clustering goes back to Donath and Hoffman, 1973.

Since then, it has been discovered, re-discoved and extended many times in different communities.

Becoming popular in ML in 2000 (Shi & Malik, Ng et al., Ding) – huger number of papers

As opposed to k-means, it can solve very general problems like intertwined spirals.

Once the similarity graph is chosen, we just have to solve a linear problem (no issues of getting stuck in local minima).

However, choosing a good similarity graph is not trivial, and spectral clustering can be quite unstable under different choices of the parameters for the neighborhood graphs.


13

Concluding Remarks (w.r.t. Clustering)

Can we use W (or S) instead of L?

However, what about tasks other than clustering, e.g., for believe propagation in transductive classification? If W and L share the same set of eigenvectors, would it make any

difference by using W instead of L?


What about using L or W for tasks other then clustering?

For belief propagation in semi-supervised learning (SSL) of classification

For multi-graph inference based on spectral transformation of product graphs

Would it make any difference if we use W instead of L, or vice versa?

In both applications above, we focus on a set of k eigenvectors of L or W, but not necessarily the order among them.


14

Using L or W?

Do they share the same set of eigenvectors?

Do they share the same top-k (or bottom-k) eigenvectors?

Does the order of the eigenvalues matter in the prediction tasks (spectral clustering, classification, or ranking in link prediction)?


𝐿 = 𝐷 −𝑊

𝐿𝑠𝑦𝑚 = 𝐷−12𝐿𝐷−

12 = 𝐼 − 𝐷−

12𝑊𝐷−

12

𝐿𝑟𝑚 ≡ 𝐷−1𝐿 = 𝐼 − 𝐷−1𝑊

Case 1. L vs. W or P=D-1W


)1(

)( 1

vPv

PvvvWDIvvvLP

rm

Lrw and P (not W) share the same set of eigenvectors and eigenvalues (with μ = 1 − 𝜆 ), with the largest eigenvalue of 1 and smallest of zero.

spectral clustering - carnegie mellon universitynyc.lti.cs.cmu.edu/classes/11-745/f16/slides/yy...

Documents