dimensionality reduction & pcadsp.ucsd.edu/~kreutz/ece174 support files/ece175b... · 25...
Post on 17-Jul-2020
1 Views
Preview:
TRANSCRIPT
Dimensionality Reduction & PCA
Ken Kreutz-Delgado
Nuno Vasconcelos
ECE 175B – Spring 2011 – UCSD
2
The Curse of Dimensionality
The curse of dimensionality means that
• for a given classifier, the number of examples required to maintain classification accuracy increases exponentially with the dimension of the feature space
In higher dimensions the classifier has more parameters
• therefore has a higher complexity & therefore is harder to learn
3
Dimensionality Reduction
How can we avoid the curse of dimensionality?
Avoid unnecessary dimensions in feature space!
“Unnecessary” can be measured in two ways:
1. Features are non-discriminative
2. Features are not independent
“Non-discriminative” means that they do not separate classes well:
discriminative non-discriminative
4
Dimensionality Reduction
In addition, dependent features are not always all needed - fewer can be enough!
E.g. consider data-mining company studying consumer credit card ratings with a feature vector:
X = {salary, mortgage, car loan, # of kids, profession, ...}
Note that the first three features tend to be highly correlated:
• “The more you make, the higher the mortgage, the more expensive the car you drive”
• From one of these variables one can often predict the others very well
Including features 2 and 3 might not greatly increase the discrimination, but definitely will increase the dimension which can lead to poor density estimates.
5
Q: how do we detect the presence of feature correlations?
A: Detect if the data “mostly lives” in a lower dimensional subspace (up to some amount of noise).
In this example we have a 1-D hyperplane in 2-D
• If we can find this hyperplane we can
• project the data onto it and thereby get rid of a dimension of the feature space without introducing significant error
Dimensionality Reduction
o o
o o o
o o o o
o
o
o o
o o o o
salary
car loan
projection onto
1-D subspace: y = a x
o o o
o
o o o o
o o
o o
salary
car loan
new 1-D feature y
6
Principal Component Analysis (PCA)
Basic idea:
• If the data “lives” in a lower dimensional subspace, it is going to look flat when viewed from the full space, e.g.
This means that
• If we fit a Gaussian to the data the iso-probability contours (aka concentration ellipsoids) are going to be “pancake-like” ellipsoids. This is because the location of the “probability mass” tells you where the data “lives.”
1-D subspace in 2-D 2-D subspace in 3-D
7
Principal Component Analysis (PCA)
How do we find these ellipsoids?
Recall that when we talked about metrics we said that
• the Mahalanobis distance: - determines the “natural” units for a Gaussian pdf - because it is “adapted” to the covariance of the data
We also know that
• what is special about it is that it uses S-1
Hence, the subspace information must be in the covariance S
1( , ) ( ) ( )
Td x y x y x y
S
8
Principal Component Analysis (PCA) All relevant information is stored in the eigenvalue decomposition of the covariance matrix
So, let’s start with a brief review of eigenvectors
• Recall: an n x n matrix represents an operator that maps a vector from Rn into a vector in Rn
E.g. the equation y = Ax represents the sending of x in Rn to y in Rn
1 11 1 1
1
n
n n nn n
y a a x
y a a x
e1
e2
en
x A
e1
e2
en
y
9
Eigenvectors What is amazing is that there are “special” (“eigen” or “proper”) vectors which are simply re-scaled by the transformation
These are the eigenvectors of the matrix A
• They are the solutions fi to the equation where li are the matrix eigenvalues
i i iA l
e1
e2
en
x A
e1
e2
en
y
10
Eigenvector Decomposition
Note that if there are n eigenvectors, these equations can be written all at once as or,
i.e.
1 1 1
| | | |
| | | |
n n nA A l l
1
1 1
| | | | 0
| | | | 0
n n
n
A
l
l
A
1
1
| | 0
| | 0
n
n
l
l
11
Symmetric Eigendecomposition
A covariance matrix is symmetric, so there are n orthogonal eigenvectors (which can be taken to have unit norm).
This implies that is an orthogonal matrix: so that
This is called the eigenvector decomposition (or eigen-decomposition) of the symmetric matrix A
It gives an alternative geometric interpretation to the matrix operation y = Ax via
ITT
TA A
xAxy T
12
Symmetric Eigendecomposition
This can be interpreted as a sequence of three steps
• 1) Apply T to x
• We know that this is (nearly) a rotation. It is just the inverse rotation
• 2) Apply
• This is just coordinate-wise scaling
• 3) Apply
• This is (nearly ) a rotation back to the initial space
'T
x x
'
1 1 1
'
'' ' '
n n n
x
x x x
x
l l
l l
''xy
13
Rotation Matrices Remember that rotation matrices are best understood by considering how the matrix operates on the vectors of the canonical basis
• Note that sends e1 to f1
• Since T is the inverse rotation, it sends f1 to e1
Hence, the sequence of operations is
• 1) Send fi to ei (the canonical basis in the fi -frame)
• 2) Rescale the ei
• 3) Rotate re-scaled ei back to the original coordinate frame
1 1
| | 1
| | 0
n
e1
e2
cos
sin
T
14
Symmetric Eigendecomposition
Graphically: This means that: 1) fi are the axes of the ellipse
2) The width of the ellipse depends on the amount of “stretching”
e1
e2
T
1)
e1
e2
e1
e2
2)
e1
e2
3)
15
Symmetric Eigendecomposition
Note that the stretching is done in step (2) for x’ = ei, the length of x’’ is li
Hence, the overall picture is
• The axes are given by the fi
• These have length li
This insight can be used to find low-dimensional data subspaces
e1
e2
e1
e2
2)
,
1 1
,
'' '
n n
x
x x
x
l
l
l1
e1
e2
16
Multivariate Gaussian Review
The equiprobability density contours of a d-dimensional Gaussian are the points such that
let’s consider the change of variable z = x-m, which only shifts the origin to the point m. The equation is the equation of an ellipse (hyper-ellipse).
This is easy to see when S is diagonal:
17
Gaussian Review
This is the equation of an ellipse with principal lengths si
• E.g. when d = 2, the equation
describes the ellipse
Guided by this insight, introduce the transformation y = z
s1
s2
z2
z1
18
Introduce the transformation y = z, assuming that z has diagonal covariance
Then y has covariance
Since is orthogonal, it is (nearly) a rotation so we have
We obtain a rotated ellipse with principal components f1 and f2 which are the columns of
Note that is the eigendecomposition of Sy
Gaussian Review
s1
s2
z2
z1
y = z s1 s2
y2
y1
f1 f2
19
Principal Component Analysis (PCA)
If y is Gaussian with covariance S, the equiprobability density contours are the ellipses whose
• “principal components” fi are the eigenvectors of S
• “principal values” si are the square roots of the eigenvalues li of S
By computing the eigenvalues we can find if the data is “flat” l1 >> l2 : flat l1 l2 : not flat
s1 s2
y2
y1
f1 f2
s1 s2
2
y2
y1 s1
s2
y2
y1
20
Learning the Principal Components
21
Principal Component Analysis (PCA)
22
Principal Component Analysis (PCA)
How to determine the number k of eigenvectors to keep?
• One possibility is to plot eigenvalue magnitudes
- This yields a “scree plot” or “L-curve”
• Usually there is a fast decrease in the eigenvalue magnitude followed by a flat area
• One good choice is the “knee” of this curve
23
Principal Component Analysis (PCA)
Another possibility is percentage of explained variance
• Remember that eigenvalues λ i = σi2 are a measure of variance
• The ratio rk measures % of total variance contained in the top k eigendirections
• This is a measure of the fraction of data variability along the associated eigenvectors
s1
s2
z2
z1
s1 s2
y2
y1
f1 f2 y = z
1
1
k
i
i
ink
i
r
l
l
24
Principal Component Analysis (PCA)
A natural measure is to pick the eigenvectors that explain p % of the data variability
• This can be done by plotting the ratio rk as a function of k
• E.g. we need 3 eigenvectors to cover 70% of the variability of this dataset
1
1
k
i
i
ink
i
r
l
l
25
Principal Component Analysis (PCA)
An alternative way to compute the principal components is to use the singular value decomposition (SVD)
SVD (full column-rank case):
• Any real n x m matrix of rank m can be decomposed as
• M is an n x m column-orthonormal matrix of “left singular vectors” (the columns of M). M is necessarily “tall” or “square”
• P is an m x m diagonal matrix of strictly positive “singular values”
• N is an m x m orthogonal matrix of “right singular vectors” (the columns of N). Note that both rows and columns of N are orthonormal.
TA M N P
and T T T
M M I N N N N I
26
PCA by SVD
To relate this to PCA, we consider the “data matrix”
where each data vector xi is d dimensional. Note that X is of dimension d x n.
The sample mean is
1
| |
| |
nX x x
1
| | 11 1 1
1
| | 1
i n
i
x x x Xn n n
m
27
PCA by SVD
We “center” the data by subtracting the mean from each column of X
This yields the “centered data matrix”
1
| | | |
| | | |
1 1 1 11 11
c n
T T T
X x x
X X X X In n
m m
m
28
PCA by SVD
The sample covariance is where xi
c is the ith column of Xc
This can be written as
1
1
| |1 1
| |
c
c c T
n c c
c
n
x
x x X Xn n
x
S
1 1 TT c c
i i i i
i i
x x x xn n
m mS
29
PCA by SVD
The matrix is real n x d. Assuming it has rank d it has an SVD
Therefore
1
c
T
c
c
n
x
X
x
T T
cX M N P
T TM M I N N I
21 1 1T T T T
c cX X N M M N N N
n n nS P P P
30
PCA by SVD
• The fact that N is d x d and orthonormal, and P 2 is diagonal, shows that this is just the eigenvalue decomposition of S
It follows that
• The eigenvectors of S are the columns of N
• The eigenvalues of S are
This gives an alternative algorithm for PCA
2
i
in
l
21 TT
n
S P
31
PCA by SVD
Computation of PCA by SVD
Given a data matrix X with one example per column
• 1) Create the centered data-matrix
• 2) Compute the SVD
• 3) Principal components are columns of N. The eigenvalues are
111
T T T
cX I X
n
T T
cX M N P
2
i
in
l
32
PCA – Example
Principal components are usually quite informative about the structure of the data
Example
• the principal components for the space of images of faces
• the figure only show the first 16 eigenvectors
• note lighting, structure, etc
33
PCA - Example
PCA has been applied to virtually all learning problems
E.g. eigenshapes for face morphing
morphed faces
34
PCA - Example
Sound
average
sound images
Eigenobjects corresponding to the three
highest eigenvalues
35
PCA – Example
Turbulence in burning
flames
eigenflames
36
PCA - Example Video
eigenrings reconstruction
37
PCA – Example
Text: latent semantic analysis (or indexing)
• represent each document by a word histogram
• perform SVD on a documents word matrix
• principal components as the directions of semantic concepts
terms
docum
ents
=
docum
ents
concepts
concepts
terms
x x
38
PCA in Latent Semantic Analysis (LSA)
Applications:
• document classification, information
Goal: solve two fundamental problems in language
• Synonymy: different writers use different words to describe the same idea.
• Polysemy, the same word can have multiple meanings
Reasons:
• original term-document matrix is too large for the computing resources
• original term-document matrix is noisy: for instance, anecdotal instances of terms are to be eliminated.
• original term-document matrix overly sparse relative to "true" term-document matrix. E.g. lists only words actually in each document, whereas we might be interested in all words related to each document-- much larger set due to synonymy
39
PCA and LSA
After PCA some dimensions get "merged":
• {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
This mitigates synonymy,
• merges the dimensions associated with terms that have similar meanings.
and mitigates polysemy,
• components of polysemous words that point in the "right" direction are added to the components of words that share this sense.
• conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.
40
END
top related