Learning Eigenfunctions: Links with
Spectral Clustering and Kernel PCA
Yoshua Bengio
Pascal Vincent
Jean-Francois Paiement
University of Montreal
April 2, Snowbird Learning’2003
Learning Modal Structures of the Distribution
Manifold learning and clustering
= learning where are the main high-density zones
Learning a tranformation that reveals “clusters” and manifolds:
Cluster = zone of high density separated from other clusters by regions of
low density
Spectral Embedding Algorithms
Many learning algorithms, e.g.� spectral clustering,
� kernel PCA,
� Local Linear Embedding (LLE),
� Isomap,
� Multi-Dimensional Scaling (MDS),
� Laplacian eigenmapshave at their core the following (or its equivalent):
1. Start from � data points � �2. Construct a � � � “neighborhood” or similarity matrix �
(with corresponding [possibly data-dependent] kernel � � � � )
3. Normalize it (and make it symmetric), yielding
��
(with corresponding kernel
�� � � � )
4. Compute � largest (equivalently, smallest) e-values/e-vectors
5. Embedding of � � = -th elements of each of the � e-vectors (possiblyscaled using e-values)
Kernel PCA
Data � � � � is implicitly mapped to “feature space” � � � � of kernel� � � � s.t.
� � � � �
�� � � � �� � � �
PCA is performed in feature space:
Projecting points in high-dim might allow to find straight line along which they
are almost aligned (if basis, i.e. kernel, is “right”).
Kernel PCA
Eigenvectors � � of (generally infinite) matrix ��� � � � � � � � � � � � � � � are
� � � �� � � � � � �
where� � is an eigenvector of Gram matrix � � � � � � � � � � � .
Projection on � -th p.c. = � � � � � � � � � � � �� � � � � � � � �
N.B. need � centered: � � � � � � � � � , � subtractive normalization
�� � � � � � � � � � � � � � � � � � � � � � � � � ��� � � ��� � � � � � � � �
(Scholkopf 96)
Laplacian Eigenmaps
� Gram matrix from Laplace-Beltrami operator ( �� � � ), which on finite
data (neighborhood graph) gives graph Laplacian.
� Gaussian kernel. Approximated by k-NN adjacency matrix
� Normalization: row average - Gram matrix.
� Laplace-Beltrami operator � : justified as a smoothness regularizer on
the manifold � : � � � � � � � � � ��� � � , which equals eigenvalue of
� for eigenfunctions � .
� Successfully used for semi-supervised learning.
(Belkin & Niyogi, 2002)
Spectral Clustering
� Normalize kernel or Gram matrix divisively:
�� � � � �
� � � �
� � � � � � � � � � � � � � � �
� Embedding of � � = �� � � �� � � � � � � �� � � where� � is � -th eigenvector of
Gram matrix.
� Perform clustering on the embedded points (e.g. after normalizing them
by their norm).
Weiss, Ng, Jordan, ...
Spectral Clustering
unitsphere
principal eigenfns approx. kernel (= dot product) in MSE sense
� � � � � � � � � �� � � � � � and � � almost colinear
� � � � � � � � � �� � � � � � and � � almost orthogonal
� points in same cluster mapped to points with near angle, even ifnon-blob cluster (global constraint = transitivity of “nearness”)
Density-Dependent Hilbert Space
Define a Hilbert space with density-dependent inner product
� � � � � � � � � � � � � � � �
with density� � � .
A kernel function � � � � defines a linear operator in that space:
� � � � � � � � � � � � � �
Eigenfunctions of a Kernel
Infinite-dimensional version of eigenvectors of Gram matrix:
� � � � � � � � � � � � � � � � � � �
(some conditions to obtain a discrete spectrum)
Convergence of
e-vec/e-values of Gram matrix from � data sampled from� � �
to
e-functions/e-values of linear operator with underlying� � � ,
proven as � � � (Williams+Seeger 2000).
Link between Spectral Clustering and Eigenfunctions
Equivalence between eigenvectors and eigenfunctions (and corr.
eigenvalues) when� � � is the empirical distribution:
Proposition 1: If we choose for� � � the empirical distribution of the
data, then the spectral embedding from
�� is equivalent to values of the
eigenfunctions of the normalized kernel
�� :� � � � � � � � � .
Proof: come and see our poster!
Link between Kernel PCA and Eigenfunctions
Proposition 2: If we choose for� � � the empirical distribution of the
data, then the kernel PCA projection is equivalent to scaled values of the
eigenfunctions of
�� : � � � � � � � � � � .
Proof: come and see our poster!
Consequence: up to the choice of kernel, kernel
normalization, and up to scaling by � � , spectral
clustering, Laplacian eigenmaps and kernel PCA
give the same embedding. Isomap, MDS and LLE also
give eigenfunctions but from a different type of
kernel.
From Embedding to General Mapping
� Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only
provided an embedding for the given data points.
� Natural generalization to new points: consider these algorithms as
learning eigenfunctions of
�� .
� eigenfunctions � � provide a mapping for new points. e.g. for empirical
� � � :� � � � �
�� � �
�� � � � � �
� Data-dependent “kernels” (Isomap, LLE): need to compute
�� � � � � �
without changing
�� � � � � � � . Reasonable for Isomap, less clear it makes
sense for LLE.
Criterion to Learn Eigenfunctions
Proposition 3: Given the first � � eigenfunctions � � of a symmetric
function � � � � , the � -th one can be obtained by minimizing w.r.t. � the
expected value of � � � � � � � � � � � ��� �
�� �
� � � � � � � � � � over
� � � � � � � � � � . Then we get � � � � � and � � � � �� � � .
This helps understand what the eigenfunctions are doing (approximating
the “dot product” � � � � ) and provides a possible criterion for
estimating the eigenfunctions when� � � is not an empirical distribution.
Kernels such as the Gaussian kernel and nearest-neighbor related kernels
force the eigenfunctions to reconstruct correctly only � � � � for nearby
objects: in high-dim, don’t trust Euclidean distance between far objects.
Using a Smooth Density to Define Eigenfunctions?
� Use your best estimator� � � of the density of the data, instead of the
data, for defining the eigenfunctions.
� Constrained class of e-fns, e.g. neural networks, can force e-fns to be
smooth and not necessarily local.
� Advantage? better generalization away from training points?
� Advantage? better scaling with � ? (no Gram matrix, no e-vectors)
� Disadvantage? optimization of e-fns may be more difficult?
Recovering the Density from the Eigenfunctions?
Visually the eigenfunctions appear to capture the main characteristics of
the density.
Can we obtain a better estimate of the density using the principal
eigenfunctions?
� (Girolami 2001): truncating the expansion� � � � ��� � � � � � .
� Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor
analyzers and project back in input space, convoluting with a
model of reconstruction error as noise.
Role of Kernel Normalization?
Subtractive normalization yields to kernel PCA:
�� � �
�� �� � � � � � � � � � ��
Thus the corresponding kernel
�� � � � �
�� � � �
�� � is expanded:
�� � � � � � � � � � � � � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � � � � � ��� � � ��� � � � � � � � �
� the constant function is an eigenfunction
� eigenfunctions have zero mean and unit variance
� double-centering normalization (MDS, Isomap): ���
� above
(based on relation between dot product and distance)
� What can be said about the divisive normalization? Seems better at
clustering.
�� � � � � � � � � � �
� � � � � � � �� � � � � � � � ��
Multi-layer Learning of Similarity and Density?
The learned eigenfunctions capture salient features of the distribution:
abstractions such as clusters and manifolds.
Old AI (and connectionist) idea: build high-level abstractions on top of
lower-level abstractions.
localEuclideansimilarity
fartherreachingnotion ofsimilarity
density+
improveddensity
+empirical
model
Density-Adjusted Similarity and Kernel
CBA
Want � and � “closer” than � and � .
Define a density adjusted distance as a geodesic wrt a Riemannian metric,
with metric tensor that penalizes low density.
SEE OTHER POSTER (Vincent & Bengio)
Density-Adjusted Similarity and Kernel
original spirals
Gaussian kernel spectral embedding
-0.4
-0.2
0
0.2
0.4
0.6
0.8
-0.2 0 0.2 0.4 0.6 0.8 1
-6-5-4-3-2-10123
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085
-6-5-4-3-2-10123
Density-adjusted embedding Density-adjusted embedding
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
-0.1 -0.05 0 0.05 0.1 0.15
-6-5-4-3-2-10123
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
-6-5-4-3-2-10123
Conclusions
� Many unsupervised learning algorithms (kernel PCA, spectral
clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked:
compute eigenfunctions of a normalized kernel.
� Embedding can be generalized to mapping applicable to new points.
� Eigenfunctions seem to capture salient features of the distribution by
minimizing kernel reconstruction error.
� Many questions open:
� eigenfunctions � recover explicit density function?
� finding e-fns with smooth� � � ?
� meaning of various kernel normalization?
� multi-layer learning?
� density-adjusted similarity (see Vincent & Bengio poster).
Proposition 3
The principal eigenfunction of the linear operator
� � � � � �� �� � � � � � � � �
corresponding to kernel � is the (or a, if repeated e-values) norm-1
function � that minimizes the reconstruction error
� � � � � � � � � � � � � � � � � � � �
� � � � � � �
Proof of Proposition 1
Proposition 1: If we choose for� � � the empirical distribution of the
data, then the spectral embedding from
�� is equivalent to values of the
eigenfunctions of the normalized kernel
�� :� � � � � � � � � .
(Simplified) proof:
As shown in Proposition 3, finding function � and scalar � minimizing
��
� � � � � � � � � � � �
� � � � � �
s.t. � � � yields a solution that satisfies
�� � � � � � � � � � � � �
with � the (possibly repeated) maximum norm eigenvalue.
Proof of Proposition 1
With empirical� � � , the above becomes ( � � � ):
�� �
�� � � � � � � � � � � � � � � � � �
Write � � � � � � � and
�� � � �
�� � � � � � � , then
�� � � � � � �
and we obtain for the principal eigenvector:�
� � � � � � � � ��
For the other eigenvalues, consider the “residual kernel”
� � � � � � � � � � � ��� �
� � � � � � � � � and recursively apply the
same reasoning to obtain� � � � � � � � � ,� � � � � � � � � , etc...
Q.E.D.
Proof of Proposition 2
Proposition 2: If we choose for� � � the empirical distribution of the
data, then the kernel PCA projection is equivalent to scaled values of the
eigenfunctions of
�� : � � � � � � � � � � .
(Simplified) proof:
Apply the linear operator�
� on both sides of
�� � � � � � � � :
��
�� � � � � �
�� � � �
or changing the order of integrals on the left-hand side:
� � ���
�� � � �
�� � � � � � � � � � � � �
�� � � � � � � � �
Plug-in
�� � � � � �
�� � � �
�� � � :
� � � ��� � � �
�� � � �
�� � � �
�� � �
�� � ��� � �
�� � � � �
Proof of Proposition 2
� � � ��
� � � � �
� � � � � � � � �
which contains elements of covariance matrix � :
� � � �
�� � �
�� � � � �
thus yielding
��
� � � � �
� � �
�� � ��� � � ��� � ��� � � � �
��
� � � � � � � �
� � � � �
or
�� � � � � � � � � �
�� � �
�� � � � � � � � � � �
�� �
where � � � ��
� � has elements � � � � �
� � � � � . So, where
�� � � takes its
values,
� � � � � � � �
where � � � � � � ��
� � is also the � -th e-vector of � .
Proof of Proposition 2
PCA projection on � � is
� � � � � � ��
� � �
� � � � � �
� � � � ��
� � �
� � � � �
� � ��
� � � � �
� � � � � � � � � �
� � � � � � �
Q.E.D.
Proof of Proposition 3
Proposition 3: Given the first � � eigenfunctions � � of a symmetric
function � � � � , the � -th one can be obtained by minimizing w.r.t. � the
expected value of � � � � � � � � � � � ��� �
�� �
� � � � � � � � � � over
� � � � � � � � � � . Then we get � � � � � and � � � � �� � � .
Proof:
Reconstruction error using approximation
� � � � � � � ��
�� �
� � � � � � � � � :
� � � � � � � � � � � � � � � ��
�� �
� � � � � � � � � �
� � � � � � �
where � � � � � �� �� �� � � with �� � � , and � � � � � � are the first � �
(eigenfunction,eigenvalue) pairs in order of decreasing absolute value of
� � .
Proof of Proposition 3
Minimization of � � wrt �� gives� � �
� �� � � � � � � � �� �� � � �� �
� ��
�� �
� � � � � � � � � �� � � �� � � � � � � � �
�� � � �� � � �� ��
� ��
�� �
� � � � � � � � � �� � � �� � � � � � � � (1)
� � � � � ��
� � �� �� � � �� � � � � � � � ��
�� �
� � � � � � � � � � � � � � �
� � � �� ��� � � ��� � �
� � � � � �
� � � � � �� �� �
using eq. 1. �� � should be maximized.
� � �
� �� ��� � � � � � � � �� �� � � �� �
� ��
�� �
� � � � ��� � � � �� �� � � �
Proof of Proposition 3
and set it equal to zero:
� � � � �� � � � �� ��
�� �
� � � � ��� � � � �� � � � �
Using �� � � � �� � �� � � � �� � �
� � � � :� �� � �� �� �
� ��
�� �
� � � � � � � � � �� � � � � (2)
Using recursive assumption that � � are orthogonal for � � :
� �� � �� �� �
� ��
�� �
� � � � � �� � � � ���
Write the application of � in terms of the eigenfunctions:
� �� �
��� �
� � � � � �� � � � � �
Proof of Proposition 3
we obtain
�� �� � � � � � � �� � � � � �
�
�� � � �
� � � � � �� � � � � �
Applying Perceval’s thm to obtain the norm on both sides:�� � � � ��
� �� � � � �� �
�
�� � � �
� ��
� �� � � � �� �
If distinct � ’s, � � � � � � � � and �� � max. when � �� � � � � = 1 and
� �� � � � � � � for � � , � � � � �� � and � � � �� � .
Since � � �� �� and obtained �� � � � and �� � � � , get � � � � � and
� � � � �� � � .
Q.E.D.