dimensionality reduction & pcadsp.ucsd.edu/~kreutz/ece174 support files/ece175b... · 25...

Dimensionality Reduction & PCA

Ken Kreutz-Delgado

Nuno Vasconcelos

ECE 175B – Spring 2011 – UCSD

2

The Curse of Dimensionality

The curse of dimensionality means that

• for a given classifier, the number of examples required to maintain classification accuracy increases exponentially with the dimension of the feature space

In higher dimensions the classifier has more parameters

• therefore has a higher complexity & therefore is harder to learn

3

Dimensionality Reduction

How can we avoid the curse of dimensionality?

Avoid unnecessary dimensions in feature space!

“Unnecessary” can be measured in two ways:

1. Features are non-discriminative

2. Features are not independent

“Non-discriminative” means that they do not separate classes well:

discriminative non-discriminative

4


In addition, dependent features are not always all needed - fewer can be enough!

E.g. consider data-mining company studying consumer credit card ratings with a feature vector:

X = {salary, mortgage, car loan, # of kids, profession, ...}

Note that the first three features tend to be highly correlated:

• “The more you make, the higher the mortgage, the more expensive the car you drive”

• From one of these variables one can often predict the others very well

Including features 2 and 3 might not greatly increase the discrimination, but definitely will increase the dimension which can lead to poor density estimates.

5

Q: how do we detect the presence of feature correlations?

A: Detect if the data “mostly lives” in a lower dimensional subspace (up to some amount of noise).

In this example we have a 1-D hyperplane in 2-D

• If we can find this hyperplane we can

• project the data onto it and thereby get rid of a dimension of the feature space without introducing significant error


o o

o o o

o o o o

o

o

o o

o o o o

salary

car loan

projection onto

1-D subspace: y = a x

o o o

o

o o o o

o o

o o

salary

car loan

new 1-D feature y

6

Principal Component Analysis (PCA)

Basic idea:

• If the data “lives” in a lower dimensional subspace, it is going to look flat when viewed from the full space, e.g.

This means that

• If we fit a Gaussian to the data the iso-probability contours (aka concentration ellipsoids) are going to be “pancake-like” ellipsoids. This is because the location of the “probability mass” tells you where the data “lives.”

1-D subspace in 2-D 2-D subspace in 3-D

7


How do we find these ellipsoids?

Recall that when we talked about metrics we said that

• the Mahalanobis distance: - determines the “natural” units for a Gaussian pdf - because it is “adapted” to the covariance of the data

We also know that

• what is special about it is that it uses S-1

Hence, the subspace information must be in the covariance S

1( , ) ( ) ( )

Td x y x y x y

S

8

Principal Component Analysis (PCA) All relevant information is stored in the eigenvalue decomposition of the covariance matrix

So, let’s start with a brief review of eigenvectors

• Recall: an n x n matrix represents an operator that maps a vector from Rn into a vector in Rn

E.g. the equation y = Ax represents the sending of x in Rn to y in Rn

1 11 1 1

1

n

n n nn n

y a a x

y a a x

e1

e2

en

x A

e1

e2

en

y

9

Eigenvectors What is amazing is that there are “special” (“eigen” or “proper”) vectors which are simply re-scaled by the transformation

These are the eigenvectors of the matrix A

• They are the solutions fi to the equation where li are the matrix eigenvalues

i i iA l

e1

e2

en

x A

e1

e2

en

y

11

Symmetric Eigendecomposition

A covariance matrix is symmetric, so there are n orthogonal eigenvectors (which can be taken to have unit norm).

This implies that is an orthogonal matrix: so that

This is called the eigenvector decomposition (or eigen-decomposition) of the symmetric matrix A

It gives an alternative geometric interpretation to the matrix operation y = Ax via

ITT

TA A

xAxy T

12


This can be interpreted as a sequence of three steps

• 1) Apply T to x

• We know that this is (nearly) a rotation. It is just the inverse rotation

• 2) Apply

• This is just coordinate-wise scaling

• 3) Apply

• This is (nearly ) a rotation back to the initial space

'T

x x

'

1 1 1

'

'' ' '

n n n

x

x x x

x

l l

l l

''xy

13

Rotation Matrices Remember that rotation matrices are best understood by considering how the matrix operates on the vectors of the canonical basis

• Note that sends e1 to f1

• Since T is the inverse rotation, it sends f1 to e1

Hence, the sequence of operations is

• 1) Send fi to ei (the canonical basis in the fi -frame)

• 2) Rescale the ei

• 3) Rotate re-scaled ei back to the original coordinate frame

1 1

| | 1

| | 0

n

e1

e2

cos

sin

T

14


Graphically: This means that: 1) fi are the axes of the ellipse

2) The width of the ellipse depends on the amount of “stretching”

e1

e2

T

1)

e1

e2

e1

e2

2)

e1

e2

3)

15


Note that the stretching is done in step (2) for x’ = ei, the length of x’’ is li

Hence, the overall picture is

• The axes are given by the fi

• These have length li

This insight can be used to find low-dimensional data subspaces

e1

e2

e1

e2

2)

,

1 1

,

'' '

n n

x

x x

x

l

l

l1

e1

e2

16

Multivariate Gaussian Review

The equiprobability density contours of a d-dimensional Gaussian are the points such that

let’s consider the change of variable z = x-m, which only shifts the origin to the point m. The equation is the equation of an ellipse (hyper-ellipse).

This is easy to see when S is diagonal:

17

Gaussian Review

This is the equation of an ellipse with principal lengths si

• E.g. when d = 2, the equation

describes the ellipse

Guided by this insight, introduce the transformation y = z

s1

s2

z2

z1

18

Introduce the transformation y = z, assuming that z has diagonal covariance

Then y has covariance

Since is orthogonal, it is (nearly) a rotation so we have

We obtain a rotated ellipse with principal components f1 and f2 which are the columns of

Note that is the eigendecomposition of Sy

Gaussian Review

s1

s2

z2

z1

y = z s1 s2

y2

y1

f1 f2

19


If y is Gaussian with covariance S, the equiprobability density contours are the ellipses whose

• “principal components” fi are the eigenvectors of S

• “principal values” si are the square roots of the eigenvalues li of S

By computing the eigenvalues we can find if the data is “flat” l1 >> l2 : flat l1 l2 : not flat

s1 s2

y2

y1

f1 f2

s1 s2

2

y2

y1 s1

s2

y2

y1

20

Learning the Principal Components

21


22


How to determine the number k of eigenvectors to keep?

• One possibility is to plot eigenvalue magnitudes

- This yields a “scree plot” or “L-curve”

• Usually there is a fast decrease in the eigenvalue magnitude followed by a flat area

• One good choice is the “knee” of this curve

23


Another possibility is percentage of explained variance

• Remember that eigenvalues λ i = σi2 are a measure of variance

• The ratio rk measures % of total variance contained in the top k eigendirections

• This is a measure of the fraction of data variability along the associated eigenvectors

s1

s2

z2

z1

s1 s2

y2

y1

f1 f2 y = z

1

1

k

i

i

ink

i

r

l

l

24


A natural measure is to pick the eigenvectors that explain p % of the data variability

• This can be done by plotting the ratio rk as a function of k

• E.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

1

1

k

i

i

ink

i

r

l

l

25


An alternative way to compute the principal components is to use the singular value decomposition (SVD)

SVD (full column-rank case):

• Any real n x m matrix of rank m can be decomposed as

• M is an n x m column-orthonormal matrix of “left singular vectors” (the columns of M). M is necessarily “tall” or “square”

• P is an m x m diagonal matrix of strictly positive “singular values”

• N is an m x m orthogonal matrix of “right singular vectors” (the columns of N). Note that both rows and columns of N are orthonormal.

TA M N P

and T T T

M M I N N N N I

26

PCA by SVD

To relate this to PCA, we consider the “data matrix”

where each data vector xi is d dimensional. Note that X is of dimension d x n.

The sample mean is

1

| |

| |

nX x x

1

| | 11 1 1

1

| | 1

i n

i

x x x Xn n n

m

27

PCA by SVD

We “center” the data by subtracting the mean from each column of X

This yields the “centered data matrix”

1

| | | |

| | | |

1 1 1 11 11

c n

T T T

X x x

X X X X In n

m m

m

28

PCA by SVD

The sample covariance is where xi

c is the ith column of Xc

This can be written as

1

1

| |1 1

| |

c

c c T

n c c

c

n

x

x x X Xn n

x

S

1 1 TT c c

i i i i

i i

x x x xn n

m mS

29

PCA by SVD

The matrix is real n x d. Assuming it has rank d it has an SVD

Therefore

1

c

T

c

c

n

x

X

x

T T

cX M N P

T TM M I N N I

21 1 1T T T T

c cX X N M M N N N

n n nS P P P

30

PCA by SVD

• The fact that N is d x d and orthonormal, and P 2 is diagonal, shows that this is just the eigenvalue decomposition of S

It follows that

• The eigenvectors of S are the columns of N

• The eigenvalues of S are

This gives an alternative algorithm for PCA

2

i

in

l

21 TT

n

S P

31

PCA by SVD

Computation of PCA by SVD

Given a data matrix X with one example per column

• 1) Create the centered data-matrix

• 2) Compute the SVD

• 3) Principal components are columns of N. The eigenvalues are

111

T T T

cX I X

n

T T

cX M N P

2

i

in

l

32

PCA – Example

Principal components are usually quite informative about the structure of the data

Example

• the principal components for the space of images of faces

• the figure only show the first 16 eigenvectors

• note lighting, structure, etc

33

PCA - Example

PCA has been applied to virtually all learning problems

E.g. eigenshapes for face morphing

morphed faces

34

PCA - Example

Sound

average

sound images

Eigenobjects corresponding to the three

highest eigenvalues

35

PCA – Example

Turbulence in burning

flames

eigenflames

36

PCA - Example Video

eigenrings reconstruction

37

PCA – Example

Text: latent semantic analysis (or indexing)

• represent each document by a word histogram

• perform SVD on a documents word matrix

• principal components as the directions of semantic concepts

terms

docum

ents

=

docum

ents

concepts

concepts

terms

x x

38

PCA in Latent Semantic Analysis (LSA)

Applications:

• document classification, information

Goal: solve two fundamental problems in language

• Synonymy: different writers use different words to describe the same idea.

• Polysemy, the same word can have multiple meanings

Reasons:

• original term-document matrix is too large for the computing resources

• original term-document matrix is noisy: for instance, anecdotal instances of terms are to be eliminated.

• original term-document matrix overly sparse relative to "true" term-document matrix. E.g. lists only words actually in each document, whereas we might be interested in all words related to each document-- much larger set due to synonymy

39

PCA and LSA

After PCA some dimensions get "merged":

• {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}

This mitigates synonymy,

• merges the dimensions associated with terms that have similar meanings.

and mitigates polysemy,

• components of polysemous words that point in the "right" direction are added to the components of words that share this sense.

• conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.

40

END

dimensionality reduction & pcadsp.ucsd.edu/~kreutz/ece174 support files/ece175b... · 25...

Documents