dimensionality reduction & pcadsp.ucsd.edu/~kreutz/ece174 support files/ece175b... · 25...

40
Dimensionality Reduction & PCA Ken Kreutz-Delgado Nuno Vasconcelos ECE 175B Spring 2011 UCSD

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

Dimensionality Reduction & PCA

Ken Kreutz-Delgado

Nuno Vasconcelos

ECE 175B – Spring 2011 – UCSD

Page 2: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

2

The Curse of Dimensionality

The curse of dimensionality means that

• for a given classifier, the number of examples required to maintain classification accuracy increases exponentially with the dimension of the feature space

In higher dimensions the classifier has more parameters

• therefore has a higher complexity & therefore is harder to learn

Page 3: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

3

Dimensionality Reduction

How can we avoid the curse of dimensionality?

Avoid unnecessary dimensions in feature space!

“Unnecessary” can be measured in two ways:

1. Features are non-discriminative

2. Features are not independent

“Non-discriminative” means that they do not separate classes well:

discriminative non-discriminative

Page 4: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

4

Dimensionality Reduction

In addition, dependent features are not always all needed - fewer can be enough!

E.g. consider data-mining company studying consumer credit card ratings with a feature vector:

X = {salary, mortgage, car loan, # of kids, profession, ...}

Note that the first three features tend to be highly correlated:

• “The more you make, the higher the mortgage, the more expensive the car you drive”

• From one of these variables one can often predict the others very well

Including features 2 and 3 might not greatly increase the discrimination, but definitely will increase the dimension which can lead to poor density estimates.

Page 5: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

5

Q: how do we detect the presence of feature correlations?

A: Detect if the data “mostly lives” in a lower dimensional subspace (up to some amount of noise).

In this example we have a 1-D hyperplane in 2-D

• If we can find this hyperplane we can

• project the data onto it and thereby get rid of a dimension of the feature space without introducing significant error

Dimensionality Reduction

o o

o o o

o o o o

o

o

o o

o o o o

salary

car loan

projection onto

1-D subspace: y = a x

o o o

o

o o o o

o o

o o

salary

car loan

new 1-D feature y

Page 6: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

6

Principal Component Analysis (PCA)

Basic idea:

• If the data “lives” in a lower dimensional subspace, it is going to look flat when viewed from the full space, e.g.

This means that

• If we fit a Gaussian to the data the iso-probability contours (aka concentration ellipsoids) are going to be “pancake-like” ellipsoids. This is because the location of the “probability mass” tells you where the data “lives.”

1-D subspace in 2-D 2-D subspace in 3-D

Page 7: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

7

Principal Component Analysis (PCA)

How do we find these ellipsoids?

Recall that when we talked about metrics we said that

• the Mahalanobis distance: - determines the “natural” units for a Gaussian pdf - because it is “adapted” to the covariance of the data

We also know that

• what is special about it is that it uses S-1

Hence, the subspace information must be in the covariance S

1( , ) ( ) ( )

Td x y x y x y

S

Page 8: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

8

Principal Component Analysis (PCA) All relevant information is stored in the eigenvalue decomposition of the covariance matrix

So, let’s start with a brief review of eigenvectors

• Recall: an n x n matrix represents an operator that maps a vector from Rn into a vector in Rn

E.g. the equation y = Ax represents the sending of x in Rn to y in Rn

1 11 1 1

1

n

n n nn n

y a a x

y a a x

e1

e2

en

x A

e1

e2

en

y

Page 9: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

9

Eigenvectors What is amazing is that there are “special” (“eigen” or “proper”) vectors which are simply re-scaled by the transformation

These are the eigenvectors of the matrix A

• They are the solutions fi to the equation where li are the matrix eigenvalues

i i iA l

e1

e2

en

x A

e1

e2

en

y

Page 10: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

10

Eigenvector Decomposition

Note that if there are n eigenvectors, these equations can be written all at once as or,

i.e.

1 1 1

| | | |

| | | |

n n nA A l l

1

1 1

| | | | 0

| | | | 0

n n

n

A

l

l

A

1

1

| | 0

| | 0

n

n

l

l

Page 11: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

11

Symmetric Eigendecomposition

A covariance matrix is symmetric, so there are n orthogonal eigenvectors (which can be taken to have unit norm).

This implies that is an orthogonal matrix: so that

This is called the eigenvector decomposition (or eigen-decomposition) of the symmetric matrix A

It gives an alternative geometric interpretation to the matrix operation y = Ax via

ITT

TA A

xAxy T

Page 12: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

12

Symmetric Eigendecomposition

This can be interpreted as a sequence of three steps

• 1) Apply T to x

• We know that this is (nearly) a rotation. It is just the inverse rotation

• 2) Apply

• This is just coordinate-wise scaling

• 3) Apply

• This is (nearly ) a rotation back to the initial space

'T

x x

'

1 1 1

'

'' ' '

n n n

x

x x x

x

l l

l l

''xy

Page 13: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

13

Rotation Matrices Remember that rotation matrices are best understood by considering how the matrix operates on the vectors of the canonical basis

• Note that sends e1 to f1

• Since T is the inverse rotation, it sends f1 to e1

Hence, the sequence of operations is

• 1) Send fi to ei (the canonical basis in the fi -frame)

• 2) Rescale the ei

• 3) Rotate re-scaled ei back to the original coordinate frame

1 1

| | 1

| | 0

n

e1

e2

cos

sin

T

Page 14: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

14

Symmetric Eigendecomposition

Graphically: This means that: 1) fi are the axes of the ellipse

2) The width of the ellipse depends on the amount of “stretching”

e1

e2

T

1)

e1

e2

e1

e2

2)

e1

e2

3)

Page 15: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

15

Symmetric Eigendecomposition

Note that the stretching is done in step (2) for x’ = ei, the length of x’’ is li

Hence, the overall picture is

• The axes are given by the fi

• These have length li

This insight can be used to find low-dimensional data subspaces

e1

e2

e1

e2

2)

,

1 1

,

'' '

n n

x

x x

x

l

l

l1

e1

e2

Page 16: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

16

Multivariate Gaussian Review

The equiprobability density contours of a d-dimensional Gaussian are the points such that

let’s consider the change of variable z = x-m, which only shifts the origin to the point m. The equation is the equation of an ellipse (hyper-ellipse).

This is easy to see when S is diagonal:

Page 17: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

17

Gaussian Review

This is the equation of an ellipse with principal lengths si

• E.g. when d = 2, the equation

describes the ellipse

Guided by this insight, introduce the transformation y = z

s1

s2

z2

z1

Page 18: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

18

Introduce the transformation y = z, assuming that z has diagonal covariance

Then y has covariance

Since is orthogonal, it is (nearly) a rotation so we have

We obtain a rotated ellipse with principal components f1 and f2 which are the columns of

Note that is the eigendecomposition of Sy

Gaussian Review

s1

s2

z2

z1

y = z s1 s2

y2

y1

f1 f2

Page 19: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

19

Principal Component Analysis (PCA)

If y is Gaussian with covariance S, the equiprobability density contours are the ellipses whose

• “principal components” fi are the eigenvectors of S

• “principal values” si are the square roots of the eigenvalues li of S

By computing the eigenvalues we can find if the data is “flat” l1 >> l2 : flat l1 l2 : not flat

s1 s2

y2

y1

f1 f2

s1 s2

2

y2

y1 s1

s2

y2

y1

Page 20: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

20

Learning the Principal Components

Page 21: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

21

Principal Component Analysis (PCA)

Page 22: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

22

Principal Component Analysis (PCA)

How to determine the number k of eigenvectors to keep?

• One possibility is to plot eigenvalue magnitudes

- This yields a “scree plot” or “L-curve”

• Usually there is a fast decrease in the eigenvalue magnitude followed by a flat area

• One good choice is the “knee” of this curve

Page 23: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

23

Principal Component Analysis (PCA)

Another possibility is percentage of explained variance

• Remember that eigenvalues λ i = σi2 are a measure of variance

• The ratio rk measures % of total variance contained in the top k eigendirections

• This is a measure of the fraction of data variability along the associated eigenvectors

s1

s2

z2

z1

s1 s2

y2

y1

f1 f2 y = z

1

1

k

i

i

ink

i

r

l

l

Page 24: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

24

Principal Component Analysis (PCA)

A natural measure is to pick the eigenvectors that explain p % of the data variability

• This can be done by plotting the ratio rk as a function of k

• E.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

1

1

k

i

i

ink

i

r

l

l

Page 25: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

25

Principal Component Analysis (PCA)

An alternative way to compute the principal components is to use the singular value decomposition (SVD)

SVD (full column-rank case):

• Any real n x m matrix of rank m can be decomposed as

• M is an n x m column-orthonormal matrix of “left singular vectors” (the columns of M). M is necessarily “tall” or “square”

• P is an m x m diagonal matrix of strictly positive “singular values”

• N is an m x m orthogonal matrix of “right singular vectors” (the columns of N). Note that both rows and columns of N are orthonormal.

TA M N P

and T T T

M M I N N N N I

Page 26: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

26

PCA by SVD

To relate this to PCA, we consider the “data matrix”

where each data vector xi is d dimensional. Note that X is of dimension d x n.

The sample mean is

1

| |

| |

nX x x

1

| | 11 1 1

1

| | 1

i n

i

x x x Xn n n

m

Page 27: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

27

PCA by SVD

We “center” the data by subtracting the mean from each column of X

This yields the “centered data matrix”

1

| | | |

| | | |

1 1 1 11 11

c n

T T T

X x x

X X X X In n

m m

m

Page 28: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

28

PCA by SVD

The sample covariance is where xi

c is the ith column of Xc

This can be written as

1

1

| |1 1

| |

c

c c T

n c c

c

n

x

x x X Xn n

x

S

1 1 TT c c

i i i i

i i

x x x xn n

m mS

Page 29: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

29

PCA by SVD

The matrix is real n x d. Assuming it has rank d it has an SVD

Therefore

1

c

T

c

c

n

x

X

x

T T

cX M N P

T TM M I N N I

21 1 1T T T T

c cX X N M M N N N

n n nS P P P

Page 30: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

30

PCA by SVD

• The fact that N is d x d and orthonormal, and P 2 is diagonal, shows that this is just the eigenvalue decomposition of S

It follows that

• The eigenvectors of S are the columns of N

• The eigenvalues of S are

This gives an alternative algorithm for PCA

2

i

in

l

21 TT

n

S P

Page 31: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

31

PCA by SVD

Computation of PCA by SVD

Given a data matrix X with one example per column

• 1) Create the centered data-matrix

• 2) Compute the SVD

• 3) Principal components are columns of N. The eigenvalues are

111

T T T

cX I X

n

T T

cX M N P

2

i

in

l

Page 32: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

32

PCA – Example

Principal components are usually quite informative about the structure of the data

Example

• the principal components for the space of images of faces

• the figure only show the first 16 eigenvectors

• note lighting, structure, etc

Page 33: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

33

PCA - Example

PCA has been applied to virtually all learning problems

E.g. eigenshapes for face morphing

morphed faces

Page 34: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

34

PCA - Example

Sound

average

sound images

Eigenobjects corresponding to the three

highest eigenvalues

Page 35: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

35

PCA – Example

Turbulence in burning

flames

eigenflames

Page 36: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

36

PCA - Example Video

eigenrings reconstruction

Page 37: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

37

PCA – Example

Text: latent semantic analysis (or indexing)

• represent each document by a word histogram

• perform SVD on a documents word matrix

• principal components as the directions of semantic concepts

terms

docum

ents

=

docum

ents

concepts

concepts

terms

x x

Page 38: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

38

PCA in Latent Semantic Analysis (LSA)

Applications:

• document classification, information

Goal: solve two fundamental problems in language

• Synonymy: different writers use different words to describe the same idea.

• Polysemy, the same word can have multiple meanings

Reasons:

• original term-document matrix is too large for the computing resources

• original term-document matrix is noisy: for instance, anecdotal instances of terms are to be eliminated.

• original term-document matrix overly sparse relative to "true" term-document matrix. E.g. lists only words actually in each document, whereas we might be interested in all words related to each document-- much larger set due to synonymy

Page 39: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

39

PCA and LSA

After PCA some dimensions get "merged":

• {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}

This mitigates synonymy,

• merges the dimensions associated with terms that have similar meanings.

and mitigates polysemy,

• components of polysemous words that point in the "right" direction are added to the components of words that share this sense.

• conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.

Page 40: Dimensionality Reduction & PCAdsp.ucsd.edu/~kreutz/ECE174 Support Files/ECE175B... · 25 Principal Component Analysis (PCA) An alternative way to compute the principal components

40

END