spectral clustering - carnegie mellon universitynyc.lti.cs.cmu.edu/classes/11-745/f16/slides/yy...

14
1 Page 1 Spectral Clustering (More General: Graph-based Structured Learning) Yiming Yang Based on the tutorial by Ulrike von Luxburg, Max Planck Institute for Biological Cybernetics @Yiming Yang, lecture on Spectral Clustering 1 @Yiming Yang, lecture on Spectral Clustering 2 Outline Motivation Graph Notation Properties of Graph Laplacians Spectral Clustering Algorithms & Empirical Findings (Ng, NIPS 2002; Shi, IEEE 2000) Semi-supervised Learning with Graph Laplacians (R Johnson & T Zhang, JMLR 2007; Hanxiao Liu, 2016) Transfer Learning with Graph Laplacians (our recent work under AAAI review)

Upload: vanphuc

Post on 12-Apr-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

1

Page 1

Spectral Clustering(More General: Graph-based

Structured Learning)

Yiming Yang

Based on the tutorial by Ulrike von Luxburg, Max Planck Institute for Biological Cybernetics

@Yiming Yang, lecture on Spectral Clustering 1

@Yiming Yang, lecture on Spectral Clustering 2

Outline

Motivation

Graph Notation

Properties of Graph Laplacians

Spectral Clustering Algorithms & Empirical Findings (Ng, NIPS 2002; Shi, IEEE 2000)

Semi-supervised Learning with Graph Laplacians (R Johnson & T Zhang, JMLR 2007; Hanxiao Liu, 2016)

Transfer Learning with Graph Laplacians (our recent work under AAAI review)

2

Page 2

Motivation: Graph-based Discovery

Finding the clusters

Connecting the dots

Predicting class labels (transductive learning)

Propagating beliefs (in social network)

@Yiming Yang, lecture on Spectral Clustering 3

Motivation: Graph Partitioning

Decomposing a computationally intractable graph into tractable subgraphs

Minimizing the connections among subgraphs;

Balancing the sizes of subgraphs.

@Yiming Yang, lecture on Spectral Clustering 4

3

Page 3

Heterogeneous Networks

@Yiming Yang, lecture on Spectral Clustering 5

(from a slide by Jiawei Han)

Challenging Research Topics

Characterizing the (densities and) connectivity in a graph and its sub-graphs

Formulating the objective functions for the tasks of interest (clustering, classification, link prediction, etc.)

Algorithms for scalable optimization and robust prediction under noisy conditions

Generalization from single-graph to multi-graph based reasoning

@Yiming Yang, lecture on Spectral Clustering 6

4

Page 4

@Yiming Yang, lecture on Spectral Clustering 7

Outline

Motivation

Graph Notation

Properties of Graph Laplacians

Spectral Clustering Algorithms & Empirical Findings (Ng, NIPS 2002; Shi, IEEE 2000)

Semi-supervised Learning with Graph Laplacians (R Johnson & T Zhang, JMLR 2007; Hanxiao Liu, 2016)

Transfer Learning with Graph Laplacians (our recent work under AAAI review)

Graph Notation

𝐺 = 𝑉, 𝐸 , an undirected graph;

𝑉 = {𝑣𝑖} for 𝑖 = 1,2,⋯𝑛, the vertex set;

𝑊 = 𝑤𝑖𝑗 with 𝑖, 𝑗𝜖 1,2,⋯ , 𝑛 and 𝑤𝑖𝑗 ≥ 0, the weighted adjacency matrix;

𝐷 = 𝑑𝑖𝑎𝑔 𝑑1, 𝑑2,⋯ , 𝑑𝑛 , the degree matrix with 𝑑𝑖 = 𝑗=1𝑛 𝑤𝑖𝑗 ;

𝐿 ≡ 𝐷 −𝑊, the (unnormalized) graph Laplacian;

𝐿𝑠𝑦𝑚 ≡ 𝐷−1

2𝐿𝐷−1

2 ≡ 𝐼 − 𝐷−1

2 𝑊𝐷−1

2, the normalized symmetric graph

Laplacian;

𝐿𝑟𝑚 ≡ 𝐷−1𝐿 = 𝐼 − 𝐷−1𝑊, the normalized random walk graph Laplacian.

@Yiming Yang, lecture on Spectral Clustering 8

5

Page 5

Graph Notation (cont’d)

𝐴 ⊂ 𝑉 a vertex subset, and its complement 𝐴 = 𝑉\A ;

𝕝𝐴 = 𝑓1, 𝑓2, ⋯ , 𝑓𝑛𝑇 ∈ 0,1 𝑛 is the indicator vector with 𝑓𝑖 = 1 if 𝑖 ∈ 𝐴

and 𝑓𝑖 = 0 otherwise;

Size 𝐴 , the number of vertices in 𝐴;

𝑣𝑜𝑙 𝐴 = 𝑖∈𝐴 𝑑𝑖 , the total weight of connected edges in 𝐴;

Set 𝐴 is a connected component if there is no connection between 𝐴 and 𝐴;

Sets 𝐴1, 𝐴2, ⋯ , 𝐴𝑘 is a partition of graph G if 𝐴𝑖 ∩ 𝐴𝑗 = ∅ and 𝐴1 ∪

𝐴2 ∪⋯∪ 𝐴𝑘 = 𝑉;

@Yiming Yang, lecture on Spectral Clustering 9

Graph Construction

𝜀-neighborhood graph

Connecting all pairs of points whose distances are smaller than 𝜀;

𝑘-nearest neighbor (kNN) graph

Connecting 𝑣𝑖 to 𝑣𝑗 if 𝑣𝑗 is among the kNN of 𝑣𝑖, or if they are among the

kNN of each other (the mutual kNN graph);

Fully connected graph

Connecting all points with some positive weights, e.g.,

𝑤𝑖𝑗 = 𝑒𝑥𝑝 −𝑥𝑖−𝑥𝑗

2

2𝜎2.

@Yiming Yang, lecture on Spectral Clustering 10

6

Page 6

Graph Construction (cont’d)

Symmetric graph

𝐿𝑠𝑦𝑚 ≡ 𝐷−1

2𝐿𝐷−1

2 = 𝐷−1

2 𝐷 −𝑊 𝐷−1

2 = 𝐼 − 𝐷−1

2𝑊𝐷−1

2

W′ ≡ 𝐷−1

2𝑊𝐷−1

2, 𝑤′𝑖𝑗=𝑤𝑖𝑗

𝑑𝑖 𝑑𝑗

Random-walk graph

𝐿𝑟𝑚 ≡ 𝐷−1𝐿 = 𝐷−1(𝐷 −𝑊) = 𝐼 − 𝐷−1𝑊 ,

W′ = 𝐷−1𝑊, 𝑤′𝑖𝑗 =𝑤𝑖𝑗

𝑑𝑖=

𝑤𝑖𝑗

𝑖′=1𝑛 𝑤𝑖′𝑗

, 𝑗=1𝑛 𝑤′𝑖𝑗 = 1

@Yiming Yang, lecture on Spectral Clustering 11

@Yiming Yang, lecture on Spectral Clustering 12

Outline

Motivation

Graph Notation

Properties of Graph Laplacians

Spectral Clustering Algorithms & Empirical Findings (Ng, NIPS 2002; Shi, IEEE 2000)

Semi-supervised Learning with Graph Laplacians (Jerry Zhu 2005; Hanxiao Liu, 2016)

Transfer Learning with Graph Laplacians (our recent work under AAAI review)

7

Page 7

Unnormalized Graph Laplacian(in undirected G with non-negative weights)

𝐿 ≡ 𝐷 −𝑊 𝑊 ≥ 0

Property 1

∀𝑓𝜖ℝ𝑛, 𝑓𝑇𝐿𝑓 =1

2

𝑖,𝑗=1

𝑛

𝑤𝑖𝑗 𝑓𝑖 − 𝑓𝑗2

Proof : (page 3)

Intuition: Minimizing 𝑓𝑇𝐿𝑓 means that the connected nodes should have similar scores.

@Yiming Yang, lecture on Spectral Clustering 13

Unnormalized Graph Laplacian (cont’d)

Property 2. 𝐿 = 𝐷 −𝑊 is symmetric and positive semidefinite (PSD).

Why?

@Yiming Yang, lecture on Spectral Clustering 14

∀𝑓𝜖ℝ𝑛, 𝑓′𝐿𝑓 =1

2

𝑖,𝑗=1

𝑛

𝑤𝑖𝑗 𝑓𝑖 − 𝑓𝑗2≥ 0

8

Page 8

Unnormalized Graph Laplacian (cont’d)

Property 3. The smallest eigenvalue of L is 0, and the corresponding eigenvector is 𝕝 (the n-by-1 vector with all the elements equals to 1).

Property 4. L has n non-negative, real-valued eigenvalues 0 = 𝜆1 ≤ 𝜆2 ≤ ⋯𝜆𝑛.

@Yiming Yang, lecture on Spectral Clustering 15

Unnormalized Graph Laplacian (cont’d)

Note: The graph Laplacian L does not depend on the diagonal elements of W.

Self-edges in a graph do not change the graph Laplacian.

Any matrix U which coincides with W on all off-diagonal positions share the same undirected graph Laplacian.

Spectral analysis on (the eigenvalues and eigenvectors of) graph Laplacian L only focuses on the links among different vertices.

@Yiming Yang, lecture on Spectral Clustering 16

9

Page 9

Unnormalized Graph Laplacian (cont’d)

Proposition 2 (Number of connected components): For an undirected graph G with non-negative weights, the multiplicity k of the eigenvalue 0 of L equals the number of connected components 𝐴1, 𝐴2, …, 𝐴𝑘, and the eigenspace of eigenvalue 0 is spanned by the indicator vectors 1𝐴1 ,1𝐴21 ,

…,1𝐴1𝑘 of those components.

Proof.

Starting from k = 1 (a fully connected graph) …

@Yiming Yang, lecture on Spectral Clustering 17

Unnormalized Graph Laplacian (cont’d)

Proof for k > 1

@Yiming Yang, lecture on Spectral Clustering 18

(https://charlesmartin14.files.wordpress.com/2012/10/l1.png)

10

Page 10

Unnormalized Graph Laplacian (cont’d)

Implications

1) If two nodes belong to the same connected component (subgraph), then the will have the identical score (of 1) in an eigenvector that corresponding to the 0 eigenvalue of L.

2) We can use the k eigenvectors that correspond to the eigenvalue of 0 to identify the connected subgraphs.

3) Specifically, letting the k eigenvectors of L (or L_sym or L_rw) be the column vectors in matrix V (n-by-k), we can run k-means over the row vectors of V to obtain the clusters of rows (nodes in the graph).

4) Normalized graph Laplacians (L_sym and L_rw) have similar properties.

@Yiming Yang, lecture on Spectral Clustering 19

@Yiming Yang, lecture on Spectral Clustering 20

11

Page 11

@Yiming Yang, lecture on Spectral Clustering 21

(Ng, 2001)

K-means

a – g: Proposed

@Yiming Yang, lecture on Spectral Clustering 22

Not checking if eigenvalue == 0 !

12

Page 12

@Yiming Yang, lecture on Spectral Clustering 23

Concluding Remarks (w.r.t. Clustering)

Spectral clustering goes back to Donath and Hoffman, 1973.

Since then, it has been discovered, re-discoved and extended many times in different communities.

Becoming popular in ML in 2000 (Shi & Malik, Ng et al., Ding) – huger number of papers

As opposed to k-means, it can solve very general problems like intertwined spirals.

Once the similarity graph is chosen, we just have to solve a linear problem (no issues of getting stuck in local minima).

However, choosing a good similarity graph is not trivial, and spectral clustering can be quite unstable under different choices of the parameters for the neighborhood graphs.

@Yiming Yang, lecture on Spectral Clustering 24

13

Page 13

Concluding Remarks (w.r.t. Clustering)

Can we use W (or S) instead of L?

However, what about tasks other than clustering, e.g., for believe propagation in transductive classification? If W and L share the same set of eigenvectors, would it make any

difference by using W instead of L?

@Yiming Yang, lecture on Spectral Clustering 25

What about using L or W for tasks other then clustering?

For belief propagation in semi-supervised learning (SSL) of classification

For multi-graph inference based on spectral transformation of product graphs

Would it make any difference if we use W instead of L, or vice versa?

In both applications above, we focus on a set of k eigenvectors of L or W, but not necessarily the order among them.

@Yiming Yang, lecture on Spectral Clustering 26

14

Page 14

Using L or W?

Do they share the same set of eigenvectors?

Do they share the same top-k (or bottom-k) eigenvectors?

Does the order of the eigenvalues matter in the prediction tasks (spectral clustering, classification, or ranking in link prediction)?

@Yiming Yang, lecture on Spectral Clustering 27

𝐿 = 𝐷 −𝑊

𝐿𝑠𝑦𝑚 = 𝐷−12𝐿𝐷−

12 = 𝐼 − 𝐷−

12𝑊𝐷−

12

𝐿𝑟𝑚 ≡ 𝐷−1𝐿 = 𝐼 − 𝐷−1𝑊

Case 1. L vs. W or P=D-1W

@Yiming Yang, lecture on Spectral Clustering 28

)1(

)( 1

vPv

PvvvWDIvvvLP

rm

Lrw and P (not W) share the same set of eigenvectors and eigenvalues (with μ = 1 − 𝜆 ), with the largest eigenvalue of 1 and smallest of zero.