Similarities, Distances and Manifold Learning
Prof. Richard C. Wilson
Dept. of Computer ScienceUniversity of York
Background
• Typically objects are characterised by features– Face images
– SIFT features
– Object spectra
– ...
• If we measure n features → n-dimensional space
• The arena for our problem is an n-dimensional vector space
Background
• Example: Eigenfaces
• Raw pixel values: n by m gives nm features
• Feature space is space of all n by m images
Background
• The space of all face-like images is smaller than the space of all images
• Assumption is faces lie on a smaller manifold embedded in the global space
All images
Face images
Manifold: A space which locally looks Euclidean
Manifold learning: Finding the manifold representing the objects we are interested in
All objects should be on the manifold, non-objects outside
Part I: Euclidean SpacePosition, Similarity and Distance
Manifold Learning in Euclidean space
Some famous techniques
Part II: Non-Euclidean ManifoldsAssessing Data
Nature and Properties of Manifolds
Data Manifolds
Learning some special types of manifolds
Part III: Advanced TechniquesMethods for intrinsically curved manifolds
Thanks to Edwin Hancock, Eliza Xu, Bob Duin for contributionsAnd support from the EU SIMBAD project
Part I: Euclidean Space
Position
The main arena for pattern recognition and machine learning problems is vector space– A set of n well defined features collected into a vector
ℝn
Also defined are addition of vectors and multiplication by a scalar
Feature vector → position
Similarity
To make meaningful progress, we need a notion of similarity
Inner product
• The inner-product ‹x,y› can be considered to be a similarity between x and y
i
ii yxyx,
Induced norm
• The self-similarity ‹x,x› is the (square of) the ‘size’ of x and gives rise to the induced norm, of the length of x:
• Finally, the length of x allows the definition of a distance in our vector space as the length of the vector joining x and y
• Inner product also gets us distance
xxx ,
yxyxyxyx ,),(d
Euclidean space
• If we have a vector space for features, and the usual inner product, all three are connected:
),( Distance
Similarity
, Position
yx
yx
yx
d
,
non-Euclidean Inner Product
• If the inner-product has the form
• Then the vector space is Euclidean
• Note we recover all the expected stuff for Euclidean space, i.e.
• The inner-product doesn’t have to be like this; for example in Einstein’s special relativity, the inner-product of spacetime is
i
iiT yxyxyx,
2222
211
21
22
21
)()()(),( nn yxyxyxd
xxx
yx
x
44332211, yxyxyxyx yx
The Golden Trio
• In Euclidean space, the concepts of position, similarity and distance are elegantly connected
PositionX
SimilarityK
DistanceD
Point position matrix
• In a normal manifold learning problem, we have a set of samples X={x1,x2,...,xm}
• These can be collected together in a matrix X
Tm
T
T
x
x
x
X2
1
I use this convention, but othersmay write them vertically
Centreing
A common and important operation is centreing – moving the mean to the origin– Centred points behave better
is the mean matrix, so is the centred matrix
– J is the all-ones matrix
This can be done with C
– C is the centreing matrix (and is symmetric C=CT)
CXXJIC / m
m/JX m/JXX
Position-Similarity
• The similarity matrix K is defined as
• From the definition of X, we simply get
• The Gram matrix is the similarity matrix of the centred points (from the definition of X)
– i.e. a centring operation on K
• Kc is really a kernel matrix for the points (linear kernel)
PositionX
SimilarityK
CKCCCXXK TTc
jiijK xx ,
TXXK
Position-Similarity
• To go from K to X, we need to consider the eigendecomposition of K
• As long as we can take the square root of Λ then we can find X as
PositionX
SimilarityK
T
T
XXK
UUK
Λ
1/2ΛUX
Kernel embedding
First manifold learning method – kernel embedding
Finds a Euclidean manifold from object similarities
• Embeds a kernel matrix into a set of points in Euclidean space (the points are automatically centred)
• K must have no negative eigenvalues, i.e. it is a kernel matrix (Mercer condition)
1/2ΛUX TUUK Λ
Similarity-Distance
SimilarityK
DistanceD
ijsijjjii
jijjii
jijiji
DKKK
d
,
2
2
,2,,
,),(
xxxxxx
xxxxxx
• We can easily determine Ds from K
Similarity-Distance
What about finding K from Ds ?
Looking at the top equation, we might imagine that
K=-½ Ds is a suitable choice
• Not centred; the relationship is actually
CCDK s2
1
ijjjiiijs KKKD 2,
Classic MDS
• Classic Multidimensional Scaling embeds a (squared) distance matrix into Euclidean space
• Using what we have so far, the algorithm is simple
• This is MDS
kernel theEmbed Λ
kernel theposeEigendecom Λ
kernel theCompute 2
1
1/2UX
KUU
CCDK
T
s
PositionX
DistanceD
The Golden Trio
PositionX
SimilarityK
DistanceD
Kernel EmbeddingMDS
ijjjiiijs
s
KKKD 22
1
,
CCDK
Kernel methods
• A kernel is function k(i,j) which computes an inner-product
– But without needing to know the actual points (the space is implicit)
• Using a kernel function we can directly compute K without knowing XPosition
X
SimilarityK
DistanceD
jijik xx ,),(
Kernel function
Kernel methods
• The implied space may be very high dimensional, but a true kernel will always produce a positive semidefinite K and the implied space will be Euclidean
• Many (most?) PR algorithms can be kernelized– Made to use K rather than X or D
• The trick is to note that any interesting vector should lie in the space spanned by the examples we are given
• Hence it can be written as a linear combination
• Look for α instead of u
αX
xxxuT
mm
2211
Kernel PCA
• What about PCA? PCA solves the following problem
• Let’s kernelize:
XuXu
Σuuu
u
u
TT
T
n
1minarg
minarg*
αKα
αXXXXα
αXXXαXXuXu
21
1
)()(11
T
TTT
TTTTTT
n
n
nn
Kernel PCA
• K2 has the same eigenvectors as K, so the eigenvectors of PCA are the same as the eigenvectors of K
• The eigenvalues of PCA are related to the eigenvectors of K by
• Kernel PCA is a kernel embedding with an externally provided kernel matrix
2PCA
1Kn
Kernel PCA
• So kernel PCA gives the same solution as kernel embedding– The eigenvalues are modified a bit
• They are essentially the same thing in Euclidean space
• MDS uses the kernel and kernel embedding
• MDS and PCA are essentially the same thing in Euclidean space
• Kernel embedding, MDS and PCA all give the same answer for a set of points in Euclidean space
Some useful observations
• Your similarity matrix is Euclidean iff it has no negative eigenvalues (i.e. it is a kernel matrix and PSD)
• By similar reasoning, your distance matrix is Euclidean iff the similarity matrix derived from it is PSD
• If the feature space is small but the number of samples is large, then the covariance matrix is small and it is better to do normal PCA (on the covariance matrix)
• If the feature space is large and the number of samples is small, then the kernel matrix will be small and it is better to do kernel embedding
Part II: Non-Euclidean Manifolds
Non-linear data
• Much of the data in computer vision lies in a high-dimensional feature space but is constrained in some way– The space of all images of a face is a subspace of the
space of all possible images
– The subspace is highly non-linear but low dimensional (described by a few parameters)
Non-linear data
• This cannot be exploited by the linear subspace methods like PCA– These assume that the subspace is a Euclidean space as well
• A classic example is the
‘swiss roll’ data:
‘Flat’ Manifolds• Fundamentally different types of data, for example:
• The embedding of this data into the high-dimensional space is highly curved– This is called extrinsic curvature, the curvature of the manifold
with respect to the embedding space
• Now imagine that this manifold was a piece of paper; you could unroll the paper into a flat plane without distorting it– No intrinsic curvature, in fact it is homeomorphic to Euclidean
space
• This manifold is different:
• It must be stretched to map it onto a plane– It has non-zero intrinsic curvature
• A flatlander living on this manifold can tell that it is curved, for example by measuring the ratio of the radius to the circumference of a circle
• In the first case, we might still hope to find Euclidean embedding
• We can never find a distortion free Euclidean embedding of the second (in the sense that the distances will always have errors)
Curved manifold
Intrinsically Euclidean Manifolds
• We cannot use the previous methods on the second type of manifold, but there is still hope for the first
• The manifold is embedded in Euclidean space, but Euclidean distance is not the correct way to measure distance
• The Euclidean distance ‘shortcuts’ the manifold• The geodesic distance calculates the shortest path along the
manifold
Geodesics
• The geodesic generalizes the concept of distance to curved manifolds– The shortest path joining two points which lies completely within
the manifold
• If we can correctly compute the geodesic distances, and the manifold is intrinsically flat, we should get Euclidean distances which we can plug into our Euclidean geometry machine Position
X
SimilarityK
DistanceD
GeodesicDistances
ISOMAP
• ISOMAP is exactly such an algorithm
• Approximate geodesic distances are computed for the points from a graph
• Nearest neighbours graph– For neighbours, Euclidean distance≈geodesic distances
– For non-neighbours, geodesic distance approximated by shortest distance in graph
• Once we have distances D, can use MDS to find Euclidean embedding
ISOMAP
• ISOMAP:– Neighbourhood graph
– Shortest path algorithm
– MDS
• ISOMAP is distance-preserving – embedded distances should be close to geodesic distances
Laplacian Eigenmap
• The Laplacian Eigenmap is another graph-based method of embedding non-linear manifolds into Euclidean space
• As with ISOMAP, form a neighbourhood graph for the datapoints
• Find the graph Laplacian as follows
• The adjacency matrix A is
• The ‘degree’ matrix D is the diagonal matrix
• The normalized graph Laplacian is
otherwise 0
connected are and if
2
jieA t
d
ij
ij
j
ijii AD
2/12/1 ADDIL
Laplacian Eigenmap
• We find the Laplacian eigenmap embedding using the eigendecomposition of L
• The embedded positions are
• Similar to ISOMAP– Structure preserving not distance preserving
TUUL
UDX 2/1
Locally-Linear Embedding
• Locally-linear Embedding is another classic method which also begins with a neighbourhood graph
• We make point i (in the original data) from a weighted sum of the neighbouring points
• Wij is 0 for any point j not in the neighbourhood (and for i=j)• We find the weights by minimising the reconstruction error
– Subject to the constrains that the weights are non-negative and sum to 1
• Gives a relatively simple closed-form solution
i j j
jiji W xx̂
2|ˆ|min ii xx
j
ijij WW 1,0
Locally-Linear Embedding
• These weights encode how well a point j represents a point i and can be interpreted as the adjacency between i and j
• A low dimensional embedding is found by then finding points to minimise the error
• In other words, we find a low-dimensional embedding which preserves the adjacency relationships
• The solution to this embedding problem turns out to be simply the eigenvectors of the matrix M
• LLE is scale-free: the final points have the covariance matrix I– Unit scale
)()( WIWIM T
j
jijii
ii W yyyy ˆ |ˆ|min 2
Comparison
• LLE might seem like quite a different process to the previous two, but actually very similar
• We can interpret the process as producing a kernel matrix followed by scale-free kernel embedding
ISOMAP Lap. Eigenmap LLE
Representation Neighbourhood graph
Neighbourhood graph
Neighbourhood graph
Similarity matrix From geodesic distances
Graph Laplacian Reconstruction weights
Embedding
UXUUΛK
WWWWJIK
T
TT
n
kk
)1(
UDX 2/12/1UX UX
Comparison
• ISOMAP is the only method which directly computes and uses the geodesic distances– The other two depend indirectly on the distances through local
structure
• LLE is scale-free, so the original distance scale is lost, but the local structure is preserved
• Computing the necessary local dimensionality to find the correct nearest neighbours is a problem for all such methods
Non-Euclidean data
• Data is Euclidean iff K is psd
• Unless you are using a kernel function, this is often not true
• Why does this happen?
What type of data do I have?
• Starting point: distance matrix
• However we do not know apriori if our measurements are representable on a manifold– We will call them dissimilarities
• Our starting point to answer the question “What type of data do I have?” will be a matrix of dissimilarities D between objects
• Types of dissimilarities– Euclidean (no intrinsic curvature)
– Non-Euclidean, metric (curved manifold)
– Non-metric (no point-like manifold representation)
Causes
• Example: Chicken pieces data
• Distance by alignment
• Global alignment of everything could find Euclidean distances
• Only local alignments are practical
Causes
Dissimilarities may also be non-metric
The data is metric if it obeys the metric conditions1. Dij≥ 0 (nonegativity)
2. Dij= 0 iff i=j (identity of indiscernables)
3. Dij= Dji (symmetry)
4. Dij≤Dik+ Dkj (triangle inequality)
Reasonable dissimilarites should meet 1&2
Causes
• Symmetry Dij= Dji
• May not be symmetric by definition• Alignment: i→j may find a better solution than
j→i
Causes
• Triangle violations Dij≤Dik+ Dkj
• ‘Extended objects’
• Finally, noise in the measure of D can cause all of these effects
k
i j
0
0
0
ij
kj
ik
D
D
D
Tests(1)
• Find the similarity matrix
• The data is Euclidean iff K is positive semidefinite (no negative eigenvalues)– K is a kernel, explicit embedding from kernel embedding
• We can then use K in a kernel algorithm
CCDK s2
1
Tests(2)
• Negative eigenfraction (NEF)
• Between 0 and 0.5
i
i
i
0NEF
Tests(3)
1. Dij≥ 0 (nonegativity)
2. Dij= 0 iff i=j (identity of indiscernables)
3. Dij= Dji (symmetry)
4. Dij≤Dik+ Dkj (triangle inequality)
– Check these for your data (3rd involves checking all triples)
– Metric data is embeddable on a (curved) Reimannian manifold
Corrections
• If the data is non-metric or non-Euclidean, we can ‘correct it’
• Symmetry violations– Average
– For min-cost distances may be more appropriate
• Triangle violations– Constant offset
– This will also remove non-Euclidean behaviour for large enough c
• Euclidean violations– Discard negative eigenvalues
• There are many other approaches*
* “On Euclidean corrections for non-Euclidean dissimilarities”, Duin, Pekalska, Harol,Lee and Bunke, S+SSPR 08
)(2
1jiijjiij DDDD
),min( jiijjiij DDDD
)( jicDD ijij
Part III: Advanced techniques for non-Euclidean Embeddings
Known Manifolds
• Sometimes we have data which lies on a known but non-Euclidean manifold
• Examples in Computer Vision– Surface normals
– Rotation matrices
– Flow tensors (DT-MRI)
• This is not Manifold Learning, as we already know what the manifold is
• What tools do we need to be able to process data like this?– As before, distances are the key
Example: 2D direction
Direction of an edge in an image, encoded as a unit vector
The average of the direction vector isn’t even a direction vector (not unit length), let alone the correct ‘average’ direction
The normal definition of mean is not correct
– Because the manifold is curved
1x
2x
x
i
inxx
1
Tangent space
• The tangent space (TP) is the Euclidean space which is parallel to the manifold(M) at a particular point (P)
• The tangent space is a very useful tool because it is Euclidean
M
TP
P
Exponential Map
• Exponential map:
• ExpP maps a point X on the tangent plane onto a point A on the manifold– P is the centre of the mapping and is at the origin on the tangent
space
– The mapping is one-to-one in a local region of P
– The most important property of the mapping is that the distances to the centre P are preserved
– The geodesic distance on the manifold equals the Euclidean distance on the tangent plane (for distances to the centre only)
XA
MT
P
PP
Exp
:Exp
),(),( PAdPXd MTP
Exponential map
• The log map goes the other way, from manifold to tangent plane
MX
TM
P
pP
Log
:Log
Exponential Map
• Example on the circle: Embed the circle in the complex plane
• The manifold representing the circle is a complex number with magnitude 1 and can be written x+iy=exp(i)
Re
ImPieP
• In this case it turns out that the map is related to the normal exp and log functions
M
TP PieP
AieA
PAi
i
P
P
A
e
ei
P
AiAX
log
logLog
APAP
P
iii
iXPXA
exp)(expexp
expExp
X
Intrinsic mean
• The mean of a set of samples is usually defined as the sum of the samples divided by the number– This is only true in Euclidean space
• A more general formula
• Minimises the distances from the mean to the samples (equivalent in Euclidean space)
i
igd ),(minarg 2 xxxx
Intrinsic mean
• We can compute this intrinsic mean using the exponential map
• If we knew what the mean was, then we can use the mean as the centre of a map
• From the properties of the Exp-map, the distances are the same
• So the mean on the tangent plane is equal to the mean on the manifold
iMi AX Log
),(),( MAdMXd igie
Intrinsic mean
• Start with a guess at the mean and move towards correct answer
• This gives us the following algorithm– Guess at a mean M0
1. Map on to tangent plane using Mi
2. Compute the mean on the tangent plane to get new estimate Mi+1
i
iMMk An
Mkk
Log1
Exp1
Intrinsic Mean
• For many manifolds, this procedure will converge to the intrinsic mean– Convergence not always guaranteed
• Other statistics and probability distributions on manifolds are problematic.– Can hypothesis a normal distribution on tangent plane, but
distortions inevitable
Some useful manifolds and maps
• Some useful manifolds and exponential maps
• Directional vectors (surface normals etc.)
• a, p unit vectors, x lies in an (n-1)D space
map) (Exp sin
cos
map) (Log )cos(sin
1 ,
xpa
pax
aa
Some useful manifolds and maps
• Symmetric positive definite matrices (covariance, flow tensors etc)
• A is symmetric positive definite, X is just symmetric
• log is the matrix log defined as a generalized matrix function
map) (Exp exp
map) (Log log
0 0 ,
21
21
21
21
21
21
21
21
PXPPPA
PAPPPX
uAuuA
T
Some useful manifolds and maps
• Orthogonal matrices (rotation matrices, eigenvector matrices)
• A orthogonal, X antisymmetric (X+XT=0)
• These are the matrix exp and log functions as before
• In fact there are multiple solutions to the matrix log– Only one is the required real antisymmetric matrix; not easy to find
– Rest are complex
map) (Exp exp
map) (Log log
I ,
XPA
APX
AAA
T
T
Embedding on Sn
• On S2 (surface of a sphere in 3D) the following parameterisation is well known
• The distance between two points (the length of the geodesic) is
Trrr )cos ,sinsin ,cossin( x
xyd
x
y
yxxyyxij rd coscossinsincos 1
xyrθ
xyθ
x
y
More Spherical Geometry
• But on a sphere, the distance is the highlighted arc-length– Much neater to use inner-product
– And works in any number of dimensions
21
2
,cos
coscos,
rrrd
rxy
xyxy
xyxy
yx
yx
Spherical Embedding
• Say we had the distances between some objects (dij), measured on the surface of a [hyper]sphere of dimension n
• The sphere (and objects) can be embedded into an n+1 dimensional space– Let X be the matrix of point positions
• Z=XXT is a kernel matrix• But• And
• We can compute Z from D and find the spherical embedding!
jiijZ xx ,
r
drZ
rrd
ijjiij
xy
cos,
,cos
2
21
xx
yx
Spherical Embedding
• But wait, we don’t know what r is!
• The distances D are non-Euclidean, and if we use the wrong radius, Z is not a kernel matrix– Negative eigenvalues
• Use this to find the radius– Choose r to minimise the negative eigenvalues
)(minarg* rZr or
Example: Texture Mapping
• As an alternative to unwrapping object onto a plane and texture-mapping the plane
• Embed onto a sphere and texture-map the sphere
Plane Sphere
Backup slides
Laplacian and related processes
• As well as embedding objects onto manifolds, we can model many interesting processes on manifolds
• Example: the way ‘heat’ flows across a manifold can be very informative
•
• On a sphere it is
equationheat 2udt
du
2
2
2
2
2
2
2
isit spaceEuclidean 3Din andLaplacian theis
zyx
sin
sin
1
sin
122
2
22 rr
Heat flow
• Heat flow allows us to do interesting things on a manifold
• Smoothing: Heat flow is a diffusion process (will smooth the data)
• Characterising the manifold (heat content, heat kernel coefficients...)
• The Laplacian depends on the geometry of the manifold– We may not know this
– It may be hard to calculate explicitly
• Graph Laplacian
Graph Laplacian
• Given a set of datapoints on the manifold, describe them by a graph– Vertices are datapoints, edges are adjacency relation
• Adjacency matrix (for example)
• Then the graph Laplacian is
• The graph Laplacian is a discrete approximation of the manifold Laplacian
2
2 )/exp(
ij
ijij d
dA
j
ijii AV AVL
Heat Kernel
• Using the graph Laplacian, we can easily implement heat-flow methods on the manifold using the heat-kernel
• Can diffuse a function on the manifold by
kernelheat )exp(
equationheat
tdt
d
LH
Luu
Hff '