Manifold Learning:From Linear To Nonlinear
Wei-Lun (Harry) Chao
Date: April 7, 2011
1
Outline
• Notation and fundamental of linear algebra
• PCA and LDA
• Topology, manifold, and embedding
• MDS
• ISOMAP
• LLE
• Laplacian eigenmap
2
Reference
• [1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007
• [2] R. O. Duda et al., Pattern Classification, 2001
• [3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997
• [4] J. B. Tenenbaum et al., A global geometric framework fornonlinear dimensionality reduction, 2000
• [5] S. T. Roweis et al., Nonlinear dimensionality reduction bylocally linear embedding, 2000
• [6] L. K. Saul et al., Think globally, fit locally, 2003
• [7] M. Belkin et al., Laplacian eigenmaps for dimensionalityreduction and data representation, 2003
• [8] T. F. Cootes et al., Active appearance models, 1998
3
Notation
• Data set:
• Matrix:
• Vector:
• Matrix form of data set:
( )
1high-D: { }n Nd
nX R x
( )
1low-D: { }n
pn NY y R
(1) (2) ( )
1 ,1[ , ,......, ] [ ]m
n m ij i n j mA a a a a
( ) ( ) ( ) ( )
1 1 2[ , ,......, ]i i i i T
n na a a a
(1) (2) ( ) (1) (2) ( )
| | |
, ,......, ......
| | |
N N
d N
N
X
x x x x x x
4
Fundamental of Linear Algebra
• SVD (singular value decomposition):
, where
(1)11
(2)22(1) (2) ( )
( )
0 ... 0 0| | |
0 0 0 0......
0 0 0| | |
0 0 0 0
T
d
T
T
d
d N
N Td ddd d N
d d N N
N
N
N
V UX
u
uv v v
u
(1)
(2)
(1) (2) ( )
( )
| | |
...... ,
| | |
,
T
T
T d T
d d N N
d T d d
d d
T T T T
V V I U U I
V V I VV U U I UU
v
vv v v
v
5
Fundamental of Linear Algebra
• EVD (Eigenvector decomposition)
• SVD vs. EVD (symmetric positive semi-definite)
• Caution: Eigenvalues are not always orthogonal!
1
2(1) (2) ( ) (1) (2) ( )
0 0| | | | | |
0 0 0...... ......
0| | | | | |
0 0
N N
N N
N
A
AV V A
v v
v v v v v v
( ) ( ) ( )
( )
T T T T T T T T T
T T
A XX V U V U V U U V V V
AV V V V AV
6
Fundamental of Linear Algebra
• Determinant:
• Trace:
• Rank:
1
N
N N n
n
A
1
( ) ( )N
N N n
n
tr A diagonal A
( ) ( )d N N d N d d Ntr A B tr B A
( ) ( ) # nonzero diagonal elements of
# independent columns of
# nonzero eigenvalues (square )
Trank A rank V U
A
A
7
Dimensionality reduction
• Operation:
• Reason:
Compression
Knowledge discovery or feature extraction
Irrelevant and noise feature removal
Visualization
Curse of dimensionality
( )
1high-D: { }n Nd
nX R x ( )
1low-D: { }n
pn NY y R ( )p d
8
Dimensionality reduction
• Methods:
Feature transform:
Feature selection:
• Criterion:
Preserve some properties or structures of the high-D feature space into the low-D feature space
These properties are measured from data
: , d pR p df R x y
linear form: p dW y x
(1) (2) ( )[ , ,......, ]
deotes selection indices
T
s s s p
s
y x x x
9
Principal Component Analysis(PCA)
10
[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007[2] R. O. Duda et al., Pattern Classification, 2001
Principal component analysis (PCA)
• PCA is basic yet important and useful:
Easy to train and use
Lots of additional functionalities: noise reduction,
ellipse fitting, singular matrix inversion, ……
• Also called Karhunen-Loeve transform (KL transform)
• Criteria:
Maximum variance with decorrelation
Minimum reconstruction error
11
Principal component analysis (PCA)
• Surprising usage: face recognition and encoding
= -2181 +627 +389 +… =
12
Principal component analysis (PCA)
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Model:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1 or say
Nn
n
NT n n
Nn
X X
1
x x
x x x x
, where and is
(orthonor al) m
T p
T
p p
T
R W d p
W
W
W I
W WW
y x y
x z = x13
Maximum variance with decorrelation
• The low-D feature vectors should be decorrelated
• Covariance variance:
• Covariance matrix:(1)
1 1 1 (2)
(1) (2) ( )
1 ( )
( )
| | |cov( , ) cov( , )1
......
cov( , ) cov( , ) | | |
1
T
d T
N
xx
d d d N T
T n
x x x x
CN
x x x x
XXN
x
xx x x
x
x x( )
1
(1) (1) ( ) ( )
| |1 1
......
| |
Nn T
n
T N N T
N N
x x x x
( ) ( )
1 2 1 1 2 2 1 1 2 2
1
1cov( , ) ( )( ) ( )( )
Nn n
n
x x E x x x x x x x xN
14
Maximum variance with decorrelation
• Covariance matrix
( ) ( )
1 2 1 1 2 2 1 1 2 2
1
1cov( , ) ( )( ) ( )( )
Nn n
n
x x E x x x x x x x xN
15
Maximum variance with decorrelation
• Decorrelation:
( ) ( )
1 1
( ) ( )
1
( ) ( )
1
1( )( ) diagonal mat x
1
i
10
1r
N Nn T n T
n n
Nn n T
NT n T n T
n
yy
n
W WN N
WCN
WN
x x
y y x x
y y
16
Maximum variance with decorrelation
• Maximum variance
( ) ( )
2 2* ( ) ( )
1 1
( ) ( ) ( ) ( )
1 1
( ) ( )
1 1
1 1arg max arg max
1 1 arg max ( ) ( ) arg max
1 1 arg max arg} max{ { }
T T
T T
T T
N Nn T n
W W I W W In n
N NT n T T n n T T n
W W I W W In n
N NT n n T
W W I W
n T T
W In n
n
W WN N
W W WWN
tr WW
N
tr W WN N
y y x
x
x
x
xx
x x
x
( ) ( )
1
1 arg max { [ ] } arg max { }
T T
NT n n T T
xxW W I W W I
n
tr W W tr W C WN
x x
(1 x 1) (p x p)
17
Maximum variance with decorrelation
• Optimization:
• Solution:
* arg max { } subject to is diagonal
T
yy xx
T
d p xx yyW
C W C W
W tr W C W C
* (1) (2) ( )
( )
* *
[ , ,......, ],
is the th largest eigenvector of
1 1 1( )
xx xx xx
xx
p
C C C
i d d
C xx
T T T T T T T
xx
T
p p
W
i C R
C XX V U U V V V V VN N N
W W I
v v v
v
18
Maximum variance with decorrelation
• Proof:
*
Assume is 1, { }
( , ) arg max
Take partial derivative 0
( ) 2 0
( 1
1 0
is the largest
)
2 eigenvect
eigenvector of
or
T T
xx xx
T
xx
T
xx x
T
T
xxx
d tr C C
E C
EC C
E
C
w
w w w w w
w w
w
w w
w
w
w
w
w w
w w
1
xx
T T T
xx
C
C V V w w w w =19
Minimum reconstruction error
• Mean square error is preferred:2 2
* ( ) ( ) ( ) ( )
1 1
( ) ( )
1
( ) ( ) ( ) ( ) ( ) ( )
1 1 1
1 1arg min min
1 arg min (( ) ) (( ) )
1 2 1 arg min ( )
ar
T T
T
T
N Nn n n T n
W W I W W In n
NN T n T N T n
W W In
N N Nn T n n T T n n T T T n
W W In n n
W W WWN N
I WW I WWN
WW W W W WN N N
x y x x
x x
x x x x x x
( ) ( ) ( ) ( )
1
( (
1
) )
1
1arg ma
1 1g min
...... argx max { }
T
T T
Nn T T n
W W In
N Nn T n n T T n
W W In n
T
xxW W I
WWN N
tr W CW WWN
x x x
x x
x
20
Algorithm
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Model:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1 or say
Nn
n
NT n n
Nn
X X
1
x x
x x x x
, where and is
(orthonor al) m
T p
T
p p
T
R W d p
W
W
W I
W WW
z x z
x z = x21
Algorithm
• Algorithm 1: (EVD)
• Algorithm 2: (SVD)
1
(1) (2) ( ) (1) (2) ( )
( )
1. , where in are in descending order
| | | | | |
2. ...... ......
| | | | | |
d
xx i i
p pd p
d p
d p p
V C V
IW VI
O
v v v v v v
1
(1) (2) ( ) (1) (2) ( )
( )
1. , where in are in descending order
| | | | | |
2. ...... ......
| | | | | |
dT
i i
p pd p
d p
d p p
X V U
IW VI
O
v v v v v v
22
Illustration
• What is PCA doing:
23
1 1
22
Summary
• PCA exploits 2nd –order statistical properties measuredin data (simple and hard to over-fitting)
• Usually used as a “preprocessing step” in applications
• Rank:
( ) ( ) (1) (1) ( ) ( )
1
| |1 1 1
......
| |
, 1 in ge( ) l1 nera
xx
Nn n T T N N T
xx
n
xxrank C N
V C V
CN N N
p N
x x x x x x
24
Examples
• Active appearance model: TW WWx z = x
25
[8]
[8]
Linear Discriminant Analysis(LDA)
26
[2] R. O. Duda et al., Pattern Classification, 2001[3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997
Linear discriminant analysis(LDA)
• PCA is unsupervised
• LDA takes the label information into consideration
• Achieved low-D features are efficient for discrimination
27
Linear discriminant analysis(LDA)
• (Training) data set:
• Model:
• Notation:
( )
1high-D: { }n Nd
nX R x
, where and is T pRW W d p y x y
( )
1 2( ) { , ,......, }n
clabel L l l l x
( )
( )
1
( )
( )
1
{ ( ) }
# samples in
1class mean:
1total mean:
ni
n N
i i n
i i
n
i
Xi
Nn
n
X label l
N X
N
N
x
x
x
x
( )
1
( ) ( )
1
between-class scatter:
within-class scatter:
( )( )
( )( )n
i
cT
B i i i
i
cn n T
W i i
i X
S N
S
x
x x
28
Linear discriminant analysis(LDA)
• Properties of scatter matrix:
• Scatter matrix in low-D:
( )
1
( ) ( )
1
inter-class separation( )( )
( )( ) intra-class tightn s e sn
i
cT
B i i i
i
cn n T
W i i
i X
S N
S
x
x x
( )
1
( ) ( )
1
between-class: ( )( )
within-class: ( )( )n
i
T
B
T
W
cT T T T T
i i i
i
cT n T T n T T
i i
i X
N W W W W
W W
W S W
WW SW W
x
x x
29
Linear discriminant analysis(LDA)
30
Criterion and algorithm
• Criterion of LDA:
• Determinant and trace are suitable scalar measures:
• With Rayleigh quotient:
Maximize the ration of to "in some sense"T T
B WW S W W S W
* arg max
T
B
TWW
W S WW
W S W
*
* (1) (2)
( ) ( )
( )
is n, are both symmetric positive semi-definite and
arg max , is in descending order
, ,.....
onsigu
.
larB W
T
B
iT
W
i i
BW
p
i
T
W
W
S S
W S WW
W
S
SS W
W
S
v
v v
v
v 31
Note and Problem
• Note:
• Problem:
( ) ( ) 1 ( ) ( )
( ) 1, at most 1 nonzero
so 1
i i i i
B i W W B i
B i
S S S S
rank S c c
p c
v v v v
( ) , and is
if ( ) , is singular, Rayleight quotient is useless
W W
W W
rank S N c S d d
rank S d S
32
Solution
• Problem:
• Solution:
PCA+LDA:
Null-space:
( )
( ) ( )
1
( )( ) is singularn
i
cn n T
W i i
i X
S
x
x x
( ) ( ) ( )
( ( )) ( ( ))
(( ) ( ))
1. Perform on ,
2. Compute , if nonsingular, the problem is solve
3. For new samples,
n n n N c
PCA d N c PCA d N c
W N c N c
LDA PCA
W W R
S
W W
x x x
y x
* *
*
1. arg max find to make 0
2. Extract columns of from the null space of
T
B T
WTWW
W
W S WW W W S W
W S W
W S
33
Example
34[3]
Topology, Manifold, and Embedding
35
[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007
Topology
• Geometrical point of view
If two or more features are dependent, their jointdistribution does not span the whole feature space
The dependence induces some structure (object) in thefeature space
1 2( , ) ( ), x x g s a s b
( )g a
( )g b
36
Topology
• Topology:
Allowed: Deformation, twisting, and stretching
Not allowed: Tearing
Topology object means properties and structures
A topology object is represented (embedding) as a spatialobject in the feature space
Topology abstracts the intrinsic structure, but ignores thedetails of spatial object
Ex: circle and ellipse are topologically homeomorphic
37
Manifold
• Geometrical space: dimensionality + structure
• Neighborhood:
• Topology space can be characterized by neighborhood
• Manifold is a locally Euclidean topological space
• Euclidean space:
• In general, any spatial object that is nearly flat in small scale is a manifold
2
( )( ) ball ( ) { }d i
LNei R B x x x x
2
(1) (2) (1) (2)( , ) is meaningfulL
dis x x x x
38
Manifold
2D+Euclidean3D+non-Euclidean
39
[5]
Embedding
• Embedding:
Embedding is a representation of a topological object (ex. amanifold, graph) in a certain space, in such a way thetopological properties are preserved
A smooth manifold is differentiable and has “functionalstructure” to link the features with latent variables.
The dimensionality of a manifold is the # latent variables
A k-manifold can be embedded to any d-dimensional spacewith d is equal to or larger than (2k+1)
40
Manifold learning
• Manifold learning:
Re-embed a k-manifold in d-dimensional space into a p-dimensional space with d <p
In practice, the structure of manifold can only be measuredfrom the underlying data.
• Importance:
The space property of the original high-D space is measuredby data, while the space property of the target low-D spacecan be defined by users!
41
Example
1 2 3 1( , , ) ( ), x x x g s a s b
1( )g a
1( )g b
a s b a b
2 ( )g a
2 ( )g b
1 2 2( , ) ( ), x x g s a s b
Re-embedding
Latent variable:
1 2: ( ) ( )f g s g s
42
Multidimensional Scaling(MDS)
43
[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007
Multidimensional Scaling(MDS)
• Distance preserving:
• Scaling: methods refer to construct a configuration ofsamples in a target metric space from information ofinter-point distances
• MDS is a scaling when the target space is Euclidean
• Here we mentioned about classical metric MDS
• Metric MDS indeed preserves pairwise inner product rather than pairwise distance
• Metric MDS is unsupervised
( ) ( ) ( ) ( )( , ) ( , )i j i jdis disx x y y
44
Multidimensional Scaling(MDS)
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Model:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1 or say
Nn
n
NT n n
Nn
X X
1
x x
x x x x
There is no to train
: , d pf R
W
R p d x y
45
Criterion
• Inner product (scalar product):
• Gram matrix: recording pairwise inner product
• Usually we only know Z, but not X
( ) ( ) ( ) ( ) ( ) ( )( , ) ( , )i j i j i T j
Xs i j s x x x x x x
(1)
(2)
(1) (2) ( )
1 ,
( )
| | |
[ ( , )] ...... .
| | |
T
T
N T
X i j N
N T
S s i j X X
x
xx x x
x
46
Criterion
• Criterion 1:
• Criterion 2:
2
2
2
2* ( ) ( ) 2
1 1
2
(1)
(2)
2 (1) (2) ( )
,
( )
arg min ( ) arg min
, where is the matrix norm, also called the Frobenius norm
| | |
( ) ( ......
| |
N Ni j T
ij LY Zi j
L
T
T
T N
ijLi j
N T
Y s S Y Y
A L
A a tr A A tr
y y
a
aa a a
a
1/2)
|
T TX X S Y Y 47
Algorithm
• EVD:1/2 1/2 1/2 1/2
(1)
1
(2)
2
( )
(
1
)
/2
1. ( )( ) ( ) ( )
rank( )
2. |
, where is a arbitrary orthonormal matrix for rotati
T T T T T
T
T
p p p N p
N T
T T
T
N
p N
U U U U Y Y
S d
I
S X X U
O
p p
U
Y I U
u
u
u
on
48
PCA vs. MDS
• (Training) data set:
• PCA: EVD on covariance matrix
• MDS: EVD on Gram matrix
( )
1high-D: { } SVD: n N T
n
dX R X V U x
1 1( )
( )) (
TT T T
xx PCA
T T
PCA d p
T
p d
C XX V U U V V V V VN N N
Y W X VI I V XX
1/2 T
p N MD
T T T T T T T
MDS
MDS S
S X X U V V U U U U U
Y I U
49
PCA vs. MDS
• Discard the rotation term and with some derivations:
• Comparison
1/2 1/2( )
( ) ( )
T T T T
MDS p N MDS p N p d
T T T T
PCA p d p d p d
Y I U I U I U
Y I V X I V V U I U
PCA: EVD on matrix
MDS: EVD on matrix
SVD: SVD on matrix
T
xx
T
d d C XX
N N S X X
d N X
50
For test data
• Model:
• For a new coming test x:
• Finally:
(generatuve view)
Use from PCA for convenience
T
d p
W W
W VI
y x x y
1/2( )
( )
(with )
T T T
d p
T
T T T T
T T T
d p N p
T T T
X V U U V U V
U V U I U I
X X U U U
W
VI
U
s x = y
y y x
x x
1/2 T
p NI U
x s
51
MDS with pairwise distance
• How about a training set with pairwise distance?
52
( ) ( )
1 ,[ ( , )] , no and i j
ij i j ND d dis X Z x x
Distance metric
• Distance metric:
Nonnegative:
Symmetric:
Triangular:
• Minkowski distance: (order p)
( ) ( ) ( ) ( ) ( ) ( )( , ) 0, ( , ) 0 iff i j i j i jdis dis x x x x x x( ) ( ) ( ) ( )( , ) ( , )i j j idis disx x x x
( ) ( ) ( ) ( ) ( ) ( )( , ) ( , ) ( , )i j i k k jdis dis dis x x x x x x
( ) ( ) ( ) ( ) ( ) ( ) 1/
1
( ) ( )
1
( ) [ ( ) ]
di j i j i j p p
k k k kpk
pdi jp
k k
k
dis x x x x
x x
x x
53
Distance metric
• (Training) data set:
• Euclidean distance and inner product:
2
( ) ( ) ( ) ( ) ( ) ( ) 2 1/2
1
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
2
( , ) [ ( ) ]
( , ) ( ) ( )
2
( , ) 2 ( , ) ( , )
1( , )
di j i j i j
k kLk
i j i j T i j
i T i i T j j T j
X X X
X
dis x x
dis
s i i s i j s j j
s i j
x x x x
x x x x x x
x x x x x x
( ) ( )2{ ( , ) }2
( , ) ( , )X X
i j s i id s jis j x x
( ) ( )
1 ,[ ( , )] , no and i j
ij i j ND d dis X Z x x
54
Distance to inner product
• Define square distance matrix:
• Double centering:
2 2 2 22
2 2 2 2
21 1 1 1
1 1 1 1( )
2
1 1 1 1( , ) ( )
2
T T T T
N N N N N N N N
N N N N
X ij ik mj mk
k m k m
S D D D DN N N
s i j d d d dN N N
1 1 1 1 1 1 1 1
2 ( ) ( )
2 2 1
2
,[ ( , )]i j
ij ij i j ND d d dis x x
55
Proof
• Proof:
2 2 ( ) ( )
1 1 1
( ) ( ) ( ) ( ) ( ) ( )
1
( ) ( ) ( ) ( ) ( ) ( )
1 1
1 1 1( , ) ( , ) 2 ( , ) ( , )
1 , 2 , ,
1 , , 2 ,
N N Nm j
mj X X X
m m m
Nm m m j j j
m
N Nj j m m m j
m m
d dis s m m s m j s j jN N N
N
N
x x
x x x x x x
x x x x x x
( ) ( )
1
2 ( ) ( )
1
( ) ( )
( )
1
( )
1 ,,
1,,
1
Nm m
m
N Nk k
ik
k
j j
i
k
i
N
dN N
x x
x x
x x
x x
56
Proof
• Proof:
• Finally
2 2 ( ) ( )
2 2 21 1 1 1 1 1
( ) ( ) ( ) ( ) ( ) ( )
21 1
( ) ( ) ( ) (
2 21 1
1 1 1( , ) ( , ) 2 ( , ) ( , )
1 , 2 , ,
1 1 , ,
N N N N N Nm k
mk X X X
m k m k m k
N Nm m m k k k
m k
N Nm m k k
m k
d dis s m m s m k s k kN N N
N
N N
x x
x x x x x x
x x x x) ( ) ( )
1 1 1 1
( ) ( ) ( ) ( )
1 1
1 12 ,
1 1 , ,
N N N Nm k
m k m k
N Nm m k k
m k
N N
N N
x x
x x x x
2 2 2
21 1
( ) ( ) ( )
1 1
( ) 1 1,
1,
N N N N
mj ik m
j j j j
k
m k m k
d d dN N N
x x x x
57
Algorithm
• Given X:
Get Z, perform MDS
• Given Z:
Perform MDS
• Given D:
Double each entry in D
Perform double centering
Perform MDS
58
Summary
• Metric MDS preserves pairwise inner product instead of pairwise distance
• It preserves linear properties
• Extension:
Sammom’s nonlinear mapping
Curvilinear component analysis (CCA)
2
1 1
( ( , ) ( , ))
( , )
N NX Y
NLM
i j X
dis i j dis i jE
dis i j
2
1 1
1( ( , ) ( , )) ( ( , ))
2
N N
CCA X Y Y
i j
E dis i j dis i j h dis i j
59
From Linear To Nonlinear
60
Linear
• PCA, LDA, MDS are linear:
Matrix operation
Linear properties (sum, scaling, commutative,…)
• Inner product, covariance:
• Assumption on the original feature space:
Euclidean or Euclidean with “rotation and scaling”
( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( )
, , ,
( )
i j k i k j k
k i j T k i T k j T
x x x x x x x
x x x x x x x
61
Problem
• If there exists structure in the feature space:
1 2 3 1( , , ) ( ), x x x g s a s b
1( )g a
1( )g b
crashed
62
Manifold way
• Assumption:
The latent space is nonlinear embedding in the feature space
The latent space is a manifold, so does the feature space
The feature space is locally smooth and Euclidean
• Local geometry or property:
Distance preserving (ISOMAP)
Neighborhood (topology) preserving (LLE)
• Caution:
There properties and structures are measured in the feature space
63
Isometric Feature Mapping(ISOMAP)
64
[4] J. B. Tenenbaum et al., A global geometric framework fornonlinear dimensionality reduction, 2000
ISOMAP
• Distance metric in feature space: Geodesic distance
• How to measure:
Small scale: Euclidean distance in
Large scale: shortest path in Graph
• The space to re-embed:
p-dimensional Euclidean space
After we get the pairwise distance, we can embed it in many kinds of space
dR
65
Graph
• (Training) data set: ( )
1high-D: { }n Nd
nX R x
Vertices
Assume placed in order
66
(1)x
( )Nx
Small scale
• Small scale: Euclidean, Large scale: graph distance
Vertices + edges
(1)x
( )Nx
2
1(1) ( ) ( ) ( 1)
1
( , ) ,N
N i i
Li
dis
x x x x
Assume placed in order
67
Distance metric
• MDS:
Vertices + edges
Assume placed in order
2 2
1(1) ( ) ( ) ( ) (1) ( ) ( ) ( 1)
1
( , ) , ( , ) ,N
N i N N i i
L Li
dis dis
y y y y x x x x
68
Algorithm
• Presetting:
• (1) Geodesic distance in neighborhood
1 ,
( )
Define distance matrxi [ ]
Set ( ) as the neighbor set of (undified)
ij i j N
i
N N D d
Nei i
x
2
( )
( ) ( )
for 1:
for 1:
if ( ( ) and )
end
end
end
j
i j
ij L
i N
j N
Nei i i j
d
x
x x
69
Algorithm
• (1) Geodesic distance in neighborhood:
• (2) Geodesic distance in large scale: (shortest path)
2
( ) ( ) ( )
( ) ( ) ( )
Neighbor:
-neighbor: ( ) iff
NN : ( ) iff ( ) or ( )
j i j
L
j j i
Nei i
K Nei i KNN i KNN j
x x x
x x x
for each pair ( , )
for 1:
min{ , }
end
end
ij ij ik kj
i j
k N
d d d d
Floyd’s algorithm:Run several round until converge
70
Algorithm
• (3) MDS:
Transfer pairwise distance into inner product:
EVD:
Proof
2( ) / 2, where ( , ) 1/ (for centering)ijD H H ND h i j
1/2 1/2
1/
1/2
2
1/2( ) ( )( ) ( ) ( )
( , 1)
T T T T T
T
p NY I
D U U U U U U
p d pU N
2 2
2 2 2 22
1 1( ) / 2 ( ) ( ) / 2
1 1 1 1 ( ) / 2
T T
N N N N N N N N
T T T T
N N N N N N N N
D HD H I D IN N
D D D DN N N N
S
1 1 1 1
1 1 1 1 1 1 1 1
71
Example
• Swiss roll:
72
[4]
Example
• Swiss roll: 350 points
MDS
ISOMAP
73
[1]
Example
74
[4]
Summary
• Compared to MDS: ISOMAP has the ability to discoverthe underlying structure (latent variables) which isnonlinear embedded in the feature space
• It is a “global methods”, which preserves all pairs ofdistances
• The Euclidean space assumption in low-D space impliesconvex properties, which is sometimes failed.
75
Locally Linear Embedding(LLE)
76
[5] S. T. Roweis et al., Nonlinear dimensionality reduction by locallylinear embedding, 2000[6] L. K. Saul et al., Think globally, fit locally, 2003
LLE
• Neighborhood preserving:
Based on the fundamental manifold properties
Preserve the local geometry of each sample and itsneighbors
Ignore the global geometry in large scale
• Assumption:
Well-sampled with sufficient data
Each sample and its neighbors lie on or closed to alocal linear patch (sub-plane) of the manifold
77
LLE
• Properties:
Local geometry is characterized by linear coefficientsthat reconstruct each sample from its neighbors
These coefficients are robust to rotation, scaling, andtranslation (RST)
• Re-embedding:
Assume the target space is locally smooth (manifold)
Locally Euclidean, but not necessary is large scale
Reconstruction coefficients are still meaningful
Stick local patches on the low-D global coordinate 78
LLE
• (Training) data set: ( )
1high-D: { }n Nd
nX R x
79
Neighborhood properties
• Linear reconstruction coefficients:
80
Re-embedding
• Local patches into global coordinate:
81
Illustration
82[5]
Algorithm
• Presetting:
• (1) Find neighbors of each sample
(1)( )
(2)( )
1 ,
( )( )
Define weight matrxi [ 0]
T
T
ij i j N
N T
N N W w
w
w
w
83
2
( )
( ) ( ) ( )
( ) ( ) ( )
Set ( ) as the neighbor set of (undified)
Neighbor:
-neighbor: ( ) iff
NN : ( ) iff ( ) or ( ),
i
j i j
L
j j i
Nei i
Nei i
K Nei i KNN i KNN j K p
x
x x x
x x x
Algorithm
• (2) Linear reconstruction coefficients:
Objective function:
Constraints: (for RST invariant)
84
2
2( ) ( )
2
2
( ) ( )
1 1
2
2( ) ( ) ( ) ( )
1
min ( ) min
min mini i
N Ni j
ijW W
i j L
Ni j i i
ij Lj L
E W w
w X
w w
x x
x x x w
( )
1
for all : 0, if ( ), 1N
j
ij ij
j
i w x Nei i w
Algorithm
• (2) Linear reconstruction coefficients: (for each sample)
85
( )1 1( )( ) ( )
2( )
2( ) ( ) ( )
1
2( )
2( ) ( ) ( )
1
| | |
Define neighbor index of , , is ( ) 1
| | |
( , ) ( 1) ( 1)
( ) ( 1) ( )
Nei ihh h
Nei i
i i T i T
m
m
Nei i
i m T i T
m
m
i Nei i
e
1 1
1 1
h x x x
x x
x x
( ) ( )
( 1)
( ) ( ) ( 1) ( 1)
T
T i T T i T T T TC
1
1 1 1 1x x
Algorithm
• (2) Linear reconstruction coefficients:
Algorithm: Run for each sample
86
11
12 0 2
2
1 0T
T
C
C
EC C C
E
1 11
1 11
1
1( ) ( )
1
Define , , and
( ) ( ),
for 1: ( )
end
m
i T T i T
T
ih m
CC
C
m Nei i
w
11 1
1 1
h
x x
Algorithm
• (3) Re-embedding: (minimize reconstruction error again)
87
2
2
( ) ( )
1 1
(1) ( ) (1) ( ) (1)
2
( )
( )
| | | | | |
| | | | | |
{( ) ( )}
{( ) ( )
( (
}
{ )
L
N Ni j
ij
i j
N N N
T T
N N N N
T
N
T T
N
T
N
Y w
tr I W Y Y I W
tr
tr Y YW Y
Y W
YW
I I
y y
y y y y w w
) }
{ ( ) }
T
N
T T T
N N
W Y
tr Y I W W W W Y
Algorithm
• (3) Re-embedding:
Definition:
Constraints: (avoid degradation)
Optimization:
88
1 ,
1
( )
[ ]
T T
N N
NN
ij ij ij ji ki kj i j N
k
M I W W W W
m w w w w
( ) ( ) ( )
1 1
10,
N Nn n n T T
n n
YY IN
y y y
* ( )
1
min { }, subject to
Rayleitz-Ritz theorem
0,
Apply
NT n T
Yn
Y tr YMY YY I
y
Algorithm
• (3) Re-embedding:
Additional property (row sum of W is 0)
Solution: (EVD)
89
1 1 1 1 1 1
1 1 1
1
1 1
[ ] 1
0
N N N N N N
ij ij ji ki kj ij ji ki kj
j k j j j k
N
ij
j
N N N N N
ji ki kj ji ki
j k j j k
w w w w w w w w
w w w
m
w w
(
1
1 )
* min { } ( )T
T
N p p
T T
p pY
p
Y I
M
O
V V
O
Y tr YMY V I
is a eigenvector of with =0N M 1
Example
• Swiss roll:
90350 points
[5]
[1]
Example
• S shape:
91
[6]
Example
92[5]
Summary
• Although the global geometry isn’t explicitly preservedduring LLE, it can still be reconstructed fromoverlapping local neighborhoods.
• The matrix M to perform EVD is indeed sparse.
• K is a key factor in LLE, so does in ISOMAP
93
Laplacian eigenmap
94
[7] M. Belkin et al., Laplacian eigenmaps for dimensionalityreduction and data representation, 2003
General setting
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Want to achieve:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1 or say
Nn
n
NT n n
Nn
X X
1
x x
x x x x
95
( )
1low-D: { }n
pn NY y R
Laplacian eigenmap
• Fundamental:
Laplacian-Beltrami operator (for smoothness)
• Presetting:
• Neighborhood definition:
96
2
( )
( ) ( ) ( )
( ) ( ) ( )
Set ( ) as the neighbor set of (undified)
Neighbor:
-neighbor: ( ) iff
NN : ( ) iff ( ) or ( )
i
j i j
L
j j i
Nei i
Nei i
K Nei i KNN i KNN j
x
x x x
x x x
1 ,Define weight matrxi [ 0]ij i j NN N W w
Algorithm
• (1) Neighborhood definition
• (2) Weight computation:
• (3) Re-embedding:
97
2
2( ) ( )
( )Heat kernek: exp( ), if ( )
0
i j
jL
ijNei iw t
x xx
2
2( ) ( )
1
1
( ) ( , ) ( ( ) ) ( )N
i j T T
Ln
ii ji
j i N
E Y W i j tr Y D W Y tr YLY
D d w
y y
Optimization
• Optimization:
98
*
( ) ( ) 1
(
1
1 )
*
arg min ( )
( )
T
T
i i
i
N p p
T
p
YD
p
Y I
p
Y tr YLY
L D D LV V
O
Y V
O
I
v v is a eigenvector of with =0N M 1
Example
• Swiss roll: 2000 points
99
[7]
Example
• Example: From 3D to 3D
100[1]
Thank you for listening
101