dimensionality reduction & pcadsp.ucsd.edu/~kreutz/ece174 support files/ece175b... · 25...

Dimensionality Reduction & PCA

Ken Kreutz-Delgado

Nuno Vasconcelos

ECE 175B – Spring 2011 – UCSD

The Curse of Dimensionality

The curse of dimensionality means that

• for a given classifier, the number of examples required to maintain classification accuracy increases exponentially with the dimension of the feature space

In higher dimensions the classifier has more parameters

• therefore has a higher complexity & therefore is harder to learn

Dimensionality Reduction

How can we avoid the curse of dimensionality?

Avoid unnecessary dimensions in feature space!

“Unnecessary” can be measured in two ways:

1. Features are non-discriminative

2. Features are not independent

“Non-discriminative” means that they do not separate classes well:

discriminative non-discriminative

In addition, dependent features are not always all needed - fewer can be enough!

E.g. consider data-mining company studying consumer credit card ratings with a feature vector:

X = {salary, mortgage, car loan, # of kids, profession, ...}

Note that the first three features tend to be highly correlated:

• “The more you make, the higher the mortgage, the more expensive the car you drive”

• From one of these variables one can often predict the others very well

Including features 2 and 3 might not greatly increase the discrimination, but definitely will increase the dimension which can lead to poor density estimates.

Q: how do we detect the presence of feature correlations?

A: Detect if the data “mostly lives” in a lower dimensional subspace (up to some amount of noise).

In this example we have a 1-D hyperplane in 2-D

• If we can find this hyperplane we can

• project the data onto it and thereby get rid of a dimension of the feature space without introducing significant error

o o o o

salary

car loan

projection onto

1-D subspace: y = a x

o o o o

salary

car loan

new 1-D feature y

Principal Component Analysis (PCA)

Basic idea:

• If the data “lives” in a lower dimensional subspace, it is going to look flat when viewed from the full space, e.g.

This means that

• If we fit a Gaussian to the data the iso-probability contours (aka concentration ellipsoids) are going to be “pancake-like” ellipsoids. This is because the location of the “probability mass” tells you where the data “lives.”

1-D subspace in 2-D 2-D subspace in 3-D

How do we find these ellipsoids?

Recall that when we talked about metrics we said that

• the Mahalanobis distance: - determines the “natural” units for a Gaussian pdf - because it is “adapted” to the covariance of the data

We also know that

• what is special about it is that it uses S-1

Hence, the subspace information must be in the covariance S

1( , ) ( ) ( )

Td x y x y x y

Principal Component Analysis (PCA) All relevant information is stored in the eigenvalue decomposition of the covariance matrix

So, let’s start with a brief review of eigenvectors

• Recall: an n x n matrix represents an operator that maps a vector from Rn into a vector in Rn

E.g. the equation y = Ax represents the sending of x in Rn to y in Rn

1 11 1 1

n n nn n

y a a x

Eigenvectors What is amazing is that there are “special” (“eigen” or “proper”) vectors which are simply re-scaled by the transformation

These are the eigenvectors of the matrix A

• They are the solutions fi to the equation where li are the matrix eigenvalues

i i iA l

Eigenvector Decomposition

Note that if there are n eigenvectors, these equations can be written all at once as or,

| | | |

n n nA A l l

| | | | 0

Symmetric Eigendecomposition

A covariance matrix is symmetric, so there are n orthogonal eigenvectors (which can be taken to have unit norm).

This implies that is an orthogonal matrix: so that

This is called the eigenvector decomposition (or eigen-decomposition) of the symmetric matrix A

It gives an alternative geometric interpretation to the matrix operation y = Ax via

xAxy T

This can be interpreted as a sequence of three steps

• 1) Apply T to x

• We know that this is (nearly) a rotation. It is just the inverse rotation

• 2) Apply

• This is just coordinate-wise scaling

• 3) Apply

• This is (nearly ) a rotation back to the initial space

'' ' '

Rotation Matrices Remember that rotation matrices are best understood by considering how the matrix operates on the vectors of the canonical basis

• Note that sends e1 to f1

• Since T is the inverse rotation, it sends f1 to e1

Hence, the sequence of operations is

• 1) Send fi to ei (the canonical basis in the fi -frame)

• 2) Rescale the ei

• 3) Rotate re-scaled ei back to the original coordinate frame

Graphically: This means that: 1) fi are the axes of the ellipse

2) The width of the ellipse depends on the amount of “stretching”

Note that the stretching is done in step (2) for x’ = ei, the length of x’’ is li

Hence, the overall picture is

• The axes are given by the fi

• These have length li

This insight can be used to find low-dimensional data subspaces

Multivariate Gaussian Review

The equiprobability density contours of a d-dimensional Gaussian are the points such that

let’s consider the change of variable z = x-m, which only shifts the origin to the point m. The equation is the equation of an ellipse (hyper-ellipse).

This is easy to see when S is diagonal:

Gaussian Review

This is the equation of an ellipse with principal lengths si

• E.g. when d = 2, the equation

describes the ellipse

Guided by this insight, introduce the transformation y = z

Introduce the transformation y = z, assuming that z has diagonal covariance

Then y has covariance

Since is orthogonal, it is (nearly) a rotation so we have

We obtain a rotated ellipse with principal components f1 and f2 which are the columns of

Note that is the eigendecomposition of Sy

Gaussian Review

y = z s1 s2

If y is Gaussian with covariance S, the equiprobability density contours are the ellipses whose

• “principal components” fi are the eigenvectors of S

• “principal values” si are the square roots of the eigenvalues li of S

By computing the eigenvalues we can find if the data is “flat” l1 >> l2 : flat l1 l2 : not flat

Learning the Principal Components

How to determine the number k of eigenvectors to keep?

• One possibility is to plot eigenvalue magnitudes

- This yields a “scree plot” or “L-curve”

• Usually there is a fast decrease in the eigenvalue magnitude followed by a flat area

• One good choice is the “knee” of this curve

Another possibility is percentage of explained variance

• Remember that eigenvalues λ i = σi2 are a measure of variance

• The ratio rk measures % of total variance contained in the top k eigendirections

• This is a measure of the fraction of data variability along the associated eigenvectors

f1 f2 y = z

A natural measure is to pick the eigenvectors that explain p % of the data variability

• This can be done by plotting the ratio rk as a function of k

• E.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

An alternative way to compute the principal components is to use the singular value decomposition (SVD)

SVD (full column-rank case):

• Any real n x m matrix of rank m can be decomposed as

• M is an n x m column-orthonormal matrix of “left singular vectors” (the columns of M). M is necessarily “tall” or “square”

• P is an m x m diagonal matrix of strictly positive “singular values”

• N is an m x m orthogonal matrix of “right singular vectors” (the columns of N). Note that both rows and columns of N are orthonormal.

TA M N P

and T T T

M M I N N N N I

PCA by SVD

To relate this to PCA, we consider the “data matrix”

where each data vector xi is d dimensional. Note that X is of dimension d x n.

The sample mean is

nX x x

| | 11 1 1

x x x Xn n n

PCA by SVD

We “center” the data by subtracting the mean from each column of X

This yields the “centered data matrix”

| | | |

1 1 1 11 11

X X X X In n

PCA by SVD

The sample covariance is where xi

c is the ith column of Xc

This can be written as

| |1 1

x x X Xn n

1 1 TT c c

i i i i

x x x xn n

PCA by SVD

The matrix is real n x d. Assuming it has rank d it has an SVD

Therefore

cX M N P

T TM M I N N I

21 1 1T T T T

c cX X N M M N N N

n n nS P P P

PCA by SVD

• The fact that N is d x d and orthonormal, and P 2 is diagonal, shows that this is just the eigenvalue decomposition of S

It follows that

• The eigenvectors of S are the columns of N

• The eigenvalues of S are

This gives an alternative algorithm for PCA

PCA by SVD

Computation of PCA by SVD

Given a data matrix X with one example per column

• 1) Create the centered data-matrix

• 2) Compute the SVD

• 3) Principal components are columns of N. The eigenvalues are

cX I X

cX M N P

PCA – Example

Principal components are usually quite informative about the structure of the data

Example

• the principal components for the space of images of faces

• the figure only show the first 16 eigenvectors

• note lighting, structure, etc

PCA - Example

PCA has been applied to virtually all learning problems

E.g. eigenshapes for face morphing

morphed faces

PCA - Example

average

sound images

Eigenobjects corresponding to the three

highest eigenvalues

PCA – Example

Turbulence in burning

flames

eigenflames

PCA - Example Video

eigenrings reconstruction

PCA – Example

Text: latent semantic analysis (or indexing)

• represent each document by a word histogram

• perform SVD on a documents word matrix

• principal components as the directions of semantic concepts

concepts

PCA in Latent Semantic Analysis (LSA)

Applications:

• document classification, information

Goal: solve two fundamental problems in language

• Synonymy: different writers use different words to describe the same idea.

• Polysemy, the same word can have multiple meanings

Reasons:

• original term-document matrix is too large for the computing resources

• original term-document matrix is noisy: for instance, anecdotal instances of terms are to be eliminated.

• original term-document matrix overly sparse relative to "true" term-document matrix. E.g. lists only words actually in each document, whereas we might be interested in all words related to each document-- much larger set due to synonymy

PCA and LSA

After PCA some dimensions get "merged":

• {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}

This mitigates synonymy,

• merges the dimensions associated with terms that have similar meanings.

and mitigates polysemy,

• components of polysemous words that point in the "right" direction are added to the components of words that share this sense.

• conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.

dimensionality reduction & pcadsp.ucsd.edu/~kreutz/ece174 support files/ece175b... · 25...

Documents

towards secure and dependable software-deﬁned networks ·...

sparse pca through low-rank approximations · principal...

lecture 08: dimensionality, principal components analysis

package ‘logisticpca’ - r · package ‘logisticpca’...

dimensionality reduction with pca - over ons ·...

lecture 14: high dimensionality & pcacs109a, protopapas,...

linear dimensionality reduction: principal component...

45496180 gregg kreutz problem solving for oil painters 1997

generalized principal component analysis · 2015-11-11 ·...

compressive online robust principal component analysis via...

high dimensionality

presentación de powerpoint - ramón areces€¦ · • m5...

generalized principal component analysis...the 4th ims asia...

introduction of machine learning - events.prace-ri.eu ·...

towards secure and dependable software-deﬁned...

principal component analysis demystified - sas...principal...

lecture notes on principal component analysisthe task of...

ken kreutz-delgado nuno...

kernels & kernelization - ucsd dsp...

krishna rajan data dimensionality reduction: introduction to...