Download - Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Learning Eigenfunctions: Links with

Spectral Clustering and Kernel PCA

Yoshua Bengio

Pascal Vincent

Jean-Francois Paiement

University of Montreal

April 2, Snowbird Learning’2003

Learning Modal Structures of the Distribution

Manifold learning and clustering

= learning where are the main high-density zones

Learning a tranformation that reveals “clusters” and manifolds:

Cluster = zone of high density separated from other clusters by regions of

low density

Spectral Embedding Algorithms

Many learning algorithms, e.g.� spectral clustering,

� kernel PCA,

� Local Linear Embedding (LLE),

� Isomap,

� Multi-Dimensional Scaling (MDS),

� Laplacian eigenmapshave at their core the following (or its equivalent):

1. Start from � data points � �2. Construct a � � � “neighborhood” or similarity matrix �

(with corresponding [possibly data-dependent] kernel � � � � )

3. Normalize it (and make it symmetric), yielding

��

(with corresponding kernel

�� )

4. Compute � largest (equivalently, smallest) e-values/e-vectors

5. Embedding of � � = -th elements of each of the � e-vectors (possiblyscaled using e-values)

Kernel PCA

Data � � � � is implicitly mapped to “feature space” � � � � of kernel� � � � s.t.

� � � � �

��

PCA is performed in feature space:

Projecting points in high-dim might allow to find straight line along which they

are almost aligned (if basis, i.e. kernel, is “right”).

Kernel PCA

Eigenvectors � � of (generally infinite) matrix �� are

� � � ��

where� � is an eigenvector of Gram matrix � � � � � � � � � � � .

Projection on � -th p.c. = � � � � � � � � � � � ��

N.B. need � centered: � � � � � � � � � , � subtractive normalization

��

(Scholkopf 96)

Laplacian Eigenmaps

� Gram matrix from Laplace-Beltrami operator ( �� ), which on finite

data (neighborhood graph) gives graph Laplacian.

� Gaussian kernel. Approximated by k-NN adjacency matrix

� Normalization: row average - Gram matrix.

� Laplace-Beltrami operator � : justified as a smoothness regularizer on

the manifold � : � � � � � � � � � �� , which equals eigenvalue of

� for eigenfunctions � .

� Successfully used for semi-supervised learning.

(Belkin & Niyogi, 2002)

Spectral Clustering

� Normalize kernel or Gram matrix divisively:

��

� � � �

� � � � � � � � � � � � � � � �

� Embedding of � � = �� where� � is � -th eigenvector of

Gram matrix.

� Perform clustering on the embedded points (e.g. after normalizing them

by their norm).

Weiss, Ng, Jordan, ...

Spectral Clustering

unitsphere

principal eigenfns approx. kernel (= dot product) in MSE sense

� � � � � � � � � �� and � � almost colinear

� � � � � � � � � �� and � � almost orthogonal

� points in same cluster mapped to points with near angle, even ifnon-blob cluster (global constraint = transitivity of “nearness”)

Density-Dependent Hilbert Space

Define a Hilbert space with density-dependent inner product

� � � � � � � � � � � � � � � �

with density� � � .

A kernel function � � � � defines a linear operator in that space:

� � � � � � � � � � � � � �

Eigenfunctions of a Kernel

Infinite-dimensional version of eigenvectors of Gram matrix:

� � � � � � � � � � � � � � � � � � �

(some conditions to obtain a discrete spectrum)

Convergence of

e-vec/e-values of Gram matrix from � data sampled from� � �

to

e-functions/e-values of linear operator with underlying� � � ,

proven as � � � (Williams+Seeger 2000).

Link between Spectral Clustering and Eigenfunctions

Equivalence between eigenvectors and eigenfunctions (and corr.

eigenvalues) when� � � is the empirical distribution:

Proposition 1: If we choose for� � � the empirical distribution of the

data, then the spectral embedding from

�� is equivalent to values of the

eigenfunctions of the normalized kernel

�� :� � � � � � � � � .

Proof: come and see our poster!

Link between Kernel PCA and Eigenfunctions


data, then the kernel PCA projection is equivalent to scaled values of the

eigenfunctions of

�� : � � � � � � � � � � .

Proof: come and see our poster!

Consequence: up to the choice of kernel, kernel

normalization, and up to scaling by � � , spectral

clustering, Laplacian eigenmaps and kernel PCA

give the same embedding. Isomap, MDS and LLE also

give eigenfunctions but from a different type of

kernel.

From Embedding to General Mapping

� Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only

provided an embedding for the given data points.

� Natural generalization to new points: consider these algorithms as

learning eigenfunctions of

�� .

� eigenfunctions � � provide a mapping for new points. e.g. for empirical

� � � :� � � � �

��

��

� Data-dependent “kernels” (Isomap, LLE): need to compute

��

without changing

�� . Reasonable for Isomap, less clear it makes

sense for LLE.

Criterion to Learn Eigenfunctions

Proposition 3: Given the first � � eigenfunctions � � of a symmetric

function � � � � , the � -th one can be obtained by minimizing w.r.t. � the

expected value of � � � � � � � � � � � ��

��

� � � � � � � � � � over

� � � � � � � � � � . Then we get � � � � � and � � � � �� .

This helps understand what the eigenfunctions are doing (approximating

the “dot product” � � � � ) and provides a possible criterion for

estimating the eigenfunctions when� � � is not an empirical distribution.

Kernels such as the Gaussian kernel and nearest-neighbor related kernels

force the eigenfunctions to reconstruct correctly only � � � � for nearby

objects: in high-dim, don’t trust Euclidean distance between far objects.

Using a Smooth Density to Define Eigenfunctions?

� Use your best estimator� � � of the density of the data, instead of the

data, for defining the eigenfunctions.

� Constrained class of e-fns, e.g. neural networks, can force e-fns to be

smooth and not necessarily local.

� Advantage? better generalization away from training points?

� Advantage? better scaling with � ? (no Gram matrix, no e-vectors)

� Disadvantage? optimization of e-fns may be more difficult?

Recovering the Density from the Eigenfunctions?

Visually the eigenfunctions appear to capture the main characteristics of

the density.

Can we obtain a better estimate of the density using the principal

eigenfunctions?

� (Girolami 2001): truncating the expansion� � � � �� .

� Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor

analyzers and project back in input space, convoluting with a

model of reconstruction error as noise.

Role of Kernel Normalization?

Subtractive normalization yields to kernel PCA:

��

��

Thus the corresponding kernel

��

��

�� is expanded:

��

� � � � � � � � � � � � � � � � � � � � ��

� the constant function is an eigenfunction

� eigenfunctions have zero mean and unit variance

� double-centering normalization (MDS, Isomap): ��

� above

(based on relation between dot product and distance)

� What can be said about the divisive normalization? Seems better at

clustering.

��

� � � � � � � ��

Multi-layer Learning of Similarity and Density?

The learned eigenfunctions capture salient features of the distribution:

abstractions such as clusters and manifolds.

Old AI (and connectionist) idea: build high-level abstractions on top of

lower-level abstractions.

localEuclideansimilarity

fartherreachingnotion ofsimilarity

density+

improveddensity

+empirical

model

Density-Adjusted Similarity and Kernel

CBA

Want � and � “closer” than � and � .

Define a density adjusted distance as a geodesic wrt a Riemannian metric,

with metric tensor that penalizes low density.

SEE OTHER POSTER (Vincent & Bengio)

Density-Adjusted Similarity and Kernel

original spirals

Gaussian kernel spectral embedding

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.2 0 0.2 0.4 0.6 0.8 1

-6-5-4-3-2-10123

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085

-6-5-4-3-2-10123

Density-adjusted embedding Density-adjusted embedding

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

-0.1 -0.05 0 0.05 0.1 0.15

-6-5-4-3-2-10123

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

-6-5-4-3-2-10123

Conclusions

� Many unsupervised learning algorithms (kernel PCA, spectral

clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked:

compute eigenfunctions of a normalized kernel.

� Embedding can be generalized to mapping applicable to new points.

� Eigenfunctions seem to capture salient features of the distribution by

minimizing kernel reconstruction error.

� Many questions open:

� eigenfunctions � recover explicit density function?

� finding e-fns with smooth� � � ?

� meaning of various kernel normalization?

� multi-layer learning?

� density-adjusted similarity (see Vincent & Bengio poster).

Proposition 3

The principal eigenfunction of the linear operator

� � � � � ��

corresponding to kernel � is the (or a, if repeated e-values) norm-1

function � that minimizes the reconstruction error

� � � � � � � � � � � � � � � � � � � �

� � � � � � �

Proof of Proposition 1


data, then the spectral embedding from

�� is equivalent to values of the

eigenfunctions of the normalized kernel

�� :� � � � � � � � � .

(Simplified) proof:

As shown in Proposition 3, finding function � and scalar � minimizing

��

� � � � � � � � � � � �

� � � � � �

s.t. � � � yields a solution that satisfies

��

with � the (possibly repeated) maximum norm eigenvalue.


With empirical� � � , the above becomes ( � � � ):

��

��

Write � � � � � � � and

��

�� , then

��

and we obtain for the principal eigenvector:�

� � � � � � � � ��

For the other eigenvalues, consider the “residual kernel”

� � � � � � � � � � � ��

� � � � � � � � � and recursively apply the

same reasoning to obtain� � � � � � � � � ,� � � � � � � � � , etc...

Q.E.D.



data, then the kernel PCA projection is equivalent to scaled values of the

eigenfunctions of

�� : � � � � � � � � � � .

(Simplified) proof:

Apply the linear operator�

� on both sides of

�� :

��

��

��

or changing the order of integrals on the left-hand side:

� � ��

��

��

��

Plug-in

��

��

�� :

� � � ��

��

��

��

��

��


� � � ��

� � � � �

� � � � � � � � �

which contains elements of covariance matrix � :

� � � �

��

��

thus yielding

��

� � � � �

� � �

��

��

� � � � � � � �

� � � � �

or

��

��

��

��

where � � � ��

� � has elements � � � � �

� � � � � . So, where

�� takes its

values,

� � � � � � � �

where � � � � � � ��

� � is also the � -th e-vector of � .


PCA projection on � � is

� � � � � � ��

� � �

� � � � � �

� � � � ��

� � �

� � � � �

� � ��

� � � � �

� � � � � � � � � �

� � � � � � �

Q.E.D.


Proposition 3: Given the first � � eigenfunctions � � of a symmetric

function � � � � , the � -th one can be obtained by minimizing w.r.t. � the

expected value of � � � � � � � � � � � ��

��

� � � � � � � � � � over

� � � � � � � � � � . Then we get � � � � � and � � � � �� .

Proof:

Reconstruction error using approximation

� � � � � � � ��

��

� � � � � � � � � :

� � � � � � � � � � � � � � � ��

��

� � � � � � � � � �

� � � � � � �

where � � � � � �� with �� , and � � � � � � are the first � �

(eigenfunction,eigenvalue) pairs in order of decreasing absolute value of

� � .


Minimization of � � wrt �� gives� � �

� ��

� ��

��

� � � � � � � � � ��

��

� ��

��

� � � � � � � � � �� (1)

� � � � � ��

� � ��

��

� � � � � � � � � � � � � � �

� � � ��

� � � � � �

� � � � � ��

using eq. 1. �� should be maximized.

� � �

� ��

� ��

��

� � � � ��


and set it equal to zero:

� � � � ��

��

� � � � ��

Using ��

� � � � :� ��

� ��

��

� � � � � � � � � �� (2)

Using recursive assumption that � � are orthogonal for � � :

� ��

� ��

��

� � � � � ��

Write the application of � in terms of the eigenfunctions:

� ��

��

� � � � � ��


we obtain

��

�

��

� � � � � ��

Applying Perceval’s thm to obtain the norm on both sides:��

� ��

�

��

� ��

� ��

If distinct � ’s, � � � � � � � � and �� max. when � �� = 1 and

� �� for � � , � � � � �� and � � � �� .

Since � � �� and obtained �� and �� , get � � � � � and

� � � � �� .

Q.E.D.

Download - Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Top Related