principal component analysis barnabás póczos university of alberta nov 24, 2009 b: chapter 12 hrf:...

70
Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Post on 21-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Principal Component

AnalysisBarnabás Póczos

University of Alberta

Nov 24, 2009

B: Chapter 12HRF: Chapter 14.5

Page 2: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

2

Contents

• Motivation• PCA algorithms• Applications

• Face recognition• Facial expression recognition

• PCA theory• Kernel-PCA

Some of these slides are taken from • Karl Booksh Research group• Tom Mitchell• Ron Parr

Page 3: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

3

PCA Applications

• Data Visualization• Data Compression• Noise Reduction• Data Classification• Trend Analysis• Factor Analysis

Page 4: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

4

Data Visualization

Example:

• Given 53 blood and urine samples (features) from 65 people.

• How can we visualize the measurements?

Page 5: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

5

Data Visualization• Matrix format (65x53)

H - W B C H - R B C H - H g b H - H c t H - M C V H - M C H H - M C H CH - M C H C

A 1 8 . 0 0 0 0 4 . 8 2 0 0 1 4 . 1 0 0 0 4 1 . 0 0 0 0 8 5 . 0 0 0 0 2 9 . 0 0 0 0 3 4 . 0 0 0 0

A 2 7 . 3 0 0 0 5 . 0 2 0 0 1 4 . 7 0 0 0 4 3 . 0 0 0 0 8 6 . 0 0 0 0 2 9 . 0 0 0 0 3 4 . 0 0 0 0

A 3 4 . 3 0 0 0 4 . 4 8 0 0 1 4 . 1 0 0 0 4 1 . 0 0 0 0 9 1 . 0 0 0 0 3 2 . 0 0 0 0 3 5 . 0 0 0 0

A 4 7 . 5 0 0 0 4 . 4 7 0 0 1 4 . 9 0 0 0 4 5 . 0 0 0 0 1 0 1 . 0 0 0 0 3 3 . 0 0 0 0 3 3 . 0 0 0 0

A 5 7 . 3 0 0 0 5 . 5 2 0 0 1 5 . 4 0 0 0 4 6 . 0 0 0 0 8 4 . 0 0 0 0 2 8 . 0 0 0 0 3 3 . 0 0 0 0

A 6 6 . 9 0 0 0 4 . 8 6 0 0 1 6 . 0 0 0 0 4 7 . 0 0 0 0 9 7 . 0 0 0 0 3 3 . 0 0 0 0 3 4 . 0 0 0 0

A 7 7 . 8 0 0 0 4 . 6 8 0 0 1 4 . 7 0 0 0 4 3 . 0 0 0 0 9 2 . 0 0 0 0 3 1 . 0 0 0 0 3 4 . 0 0 0 0

A 8 8 . 6 0 0 0 4 . 8 2 0 0 1 5 . 8 0 0 0 4 2 . 0 0 0 0 8 8 . 0 0 0 0 3 3 . 0 0 0 0 3 7 . 0 0 0 0

A 9 5 . 1 0 0 0 4 . 7 1 0 0 1 4 . 0 0 0 0 4 3 . 0 0 0 0 9 2 . 0 0 0 0 3 0 . 0 0 0 0 3 2 . 0 0 0 0 Inst

an

ces

Features

Difficult to see the correlations between the features...

Page 6: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

6

Data Visualization• Spectral format (65 pictures, one for each person)

0 10 20 30 40 50 600100200300400500600700800900

1000

measurement

Valu

e

Measurement

Difficult to compare the different patients...

Page 7: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

7

Data Visualization

0 10 20 30 40 50 60 700

0.20.40.60.811.21.41.61.8

Person

H-B

an

ds

• Spectral format (53 pictures, one for each feature)

Difficult to see the correlations between the features...

Page 8: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

8

0 50 150 250 350 45050100150200250300350400450500550

C-Triglycerides

C-L

DH

0100

200300

400500

0200

4006000

1

2

3

4

C-TriglyceridesC-LDH

M-E

PI

Bi-variate Tri-variate

Data Visualization

How can we visualize the other variables???

… difficult to see in 4 or higher dimensional spaces...

Page 9: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

9

Data Visualization• Is there a representation better than the coordinate

axes?

• Is it really necessary to show all the 53 dimensions?• … what if there are strong correlations between

the features?

• How could we find the smallest subspace of the 53-D space that keeps the most information about the original data?

• A solution: Principal Component Analysis

Page 10: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

10

Principle Component Analysis

Orthogonal projection of data onto lower-dimension linear space that...• maximizes variance of projected data (purple

line)

• minimizes mean squared distance between •data point and •projections (sum of blue lines)

PCA:

Page 11: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

11

Principle Components Analysis

Idea: • Given data points in a d-dimensional space,

project into lower dimensional space while preserving as much information as possible• Eg, find best planar approximation to 3D data• Eg, find best 12-D approximation to 104-D data

• In particular, choose projection that minimizes squared error in reconstructing original data

Page 12: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

12

• Vectors originating from the center of mass

• Principal component #1 points in the direction of the largest variance.

• Each subsequent principal component…• is orthogonal to the previous ones, and • points in the directions of the largest

variance of the residual subspace

The Principal Components

Page 13: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

13

2D Gaussian dataset

Page 14: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

14

1st PCA axis

Page 15: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

15

2nd PCA axis

Page 16: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

16

PCA algorithm I (sequential)

m

i

k

ji

Tjji

Tk m 1

21

11

})]({[1

maxarg xwwxwww

}){(1

maxarg1

2i

11

m

i

T

mxww

w

We maximize the variance of the projection in the residual subspace

We maximize the variance of projection of x

x’ PCA reconstruction

Given the centered data {x1, …, xm}, compute the principal vectors:

1st PCA vector

kth PCA vector

w1(w1Tx)

w2(w2Tx)

x

w1

w2

x’=w1(w1Tx)+w2(w2

Tx)

w

Page 17: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

17

PCA algorithm II (sample covariance matrix)

• Given data {x1, …, xm}, compute covariance matrix

• PCA basis vectors = the eigenvectors of

• Larger eigenvalue more important eigenvectors

m

i

Tim 1

))((1

xxxx

m

iim 1

1xxwhere

Page 18: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

18

PCA algorithm II PCA algorithm(X, k): top k eigenvalues/eigenvectors

% X = N m data matrix, % … each data point xi = column vector, i=1..m

• X subtract mean x from each column vector xi in X

• X XT … covariance matrix of X

• { i, ui }i=1..N = eigenvectors/eigenvalues of 1 2 … N

• Return { i, ui }i=1..k

% top k principle components

m

im 1

1ixx

Page 19: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

19

PCA algorithm III (SVD of the data matrix)

Singular Value Decomposition of the centered data matrix X.

Xfeatures samples = USVT

X VTSU=

samples

significant

noise

nois

e noise

sign

ific

ant

sig.

Page 20: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

20

PCA algorithm III• Columns of U

• the principal vectors, { u(1), …, u(k) }• orthogonal and has unit norm – so UTU = I• Can reconstruct the data using linear

combinations of { u(1), …, u(k) }

• Matrix S • Diagonal• Shows importance of each eigenvector

• Columns of VT • The coefficients for reconstructing the

samples

Page 21: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Face recognition

Page 22: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

22

Challenge: Facial Recognition• Want to identify specific person, based on facial

image• Robust to glasses, lighting,…

Can’t just use the given 256 x 256 pixels

Page 23: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

23

Applying PCA: Eigenfaces

• Example data set: Images of faces • Famous Eigenface approach

[Turk & Pentland], [Sirovich & Kirby]• Each face x is …• 256 256 values (luminance at location)

• x in 256256 (view as 64K dim vector)

• Form X = [ x1 , …, xm ] centered data mtx

• Compute = XXT • Problem: is 64K 64K … HUGE!!!

25

6 x

25

6

real v

alu

es

m faces

X =

x1, …, xm

Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best

Method B: Build one PCA database for the whole dataset and then classify based on the weights.

Page 24: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

24

Computational Complexity

• Suppose m instances, each of size N• Eigenfaces: m=500 faces, each of size

N=64K• Given NN covariance matrix can

compute • all N eigenvectors/eigenvalues in O(N3)• first k eigenvectors/eigenvalues in O(k N2)

• But if N=64K, EXPENSIVE!

Page 25: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

25

A Clever Workaround

• Note that m<<64K• Use L=XTX instead of =XXT

• If v is eigenvector of Lthen Xv is eigenvector of

Proof: L v = v XTX v = v X (XTX v) = X( v) = Xv (XXT)X v = (Xv)Xv) = (Xv)

25

6 x

25

6

real v

alu

es

m faces

X =

x1, …, xm

Page 26: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

26

Principle Components (Method B)

Page 27: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

27

Reconstructing… (Method B)

• … faster if train with…• only people w/out glasses• same lighting conditions

Page 28: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

28

Shortcomings• Requires carefully controlled data:

• All faces centered in frame• Same size• Some sensitivity to angle

• Alternative:• “Learn” one set of PCA vectors for each angle• Use the one with lowest error

• Method is completely knowledge free• (sometimes this is good!)• Doesn’t know that faces are wrapped around 3D

objects (heads)• Makes no effort to preserve class distinctions

Page 29: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Facial expression recognition

Page 30: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

30

Happiness subspace (method A)

Page 31: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

31

Disgust subspace (method A)

Page 32: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

32

Facial Expression Recognition

Movies (method A)

Page 33: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

33

Facial Expression Recognition

Movies (method A)

Page 34: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

34

Facial Expression Recognition

Movies (method A)

Page 35: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Image Compression

Page 36: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

36

Original Image

• Divide the original 372x492 image into patches:

• Each patch is an instance that contains 12x12 pixels on a grid

• View each as a 144-D vector

Page 37: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

37

L2 error and PCA dim

Page 38: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

38

PCA compression: 144D ) 60D

Page 39: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

39

PCA compression: 144D ) 16D

Page 40: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

40

16 most important eigenvectors

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

2 4 6 8 10 12

2468

1012

Page 41: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

41

PCA compression: 144D ) 6D

Page 42: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

42

2 4 6 8 10 12

2

4

6

8

10

122 4 6 8 10 12

2

4

6

8

10

122 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

122 4 6 8 10 12

2

4

6

8

10

122 4 6 8 10 12

2

4

6

8

10

12

6 most important eigenvectors

Page 43: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

43

PCA compression: 144D ) 3D

Page 44: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

44

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

2 4 6 8 10 12

2

4

6

8

10

12

3 most important eigenvectors

Page 45: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

45

PCA compression: 144D ) 1D

Page 46: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

46

60 most important eigenvectors

Looks like the discrete cosine bases of JPG!...

Page 47: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

47

2D Discrete Cosine Basis

http://en.wikipedia.org/wiki/Discrete_cosine_transform

Page 48: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Noise Filtering

Page 49: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

49

Noise Filtering, Auto-Encoder…

x x’

U x

Page 50: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

50

Noisy image

Page 51: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

51

Denoised image using 15 PCA components

Page 52: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

PCA Shortcomings

Page 53: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

53

PCA, a Problematic Data Set

PCA doesn’t know labels!

Page 54: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

54

PCA vs Fisher Linear Discriminant

• PCA maximizes variance,independent of class magenta

• FLD attempts to separate classes green line

Page 55: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

55

PCA, a Problematic Data Set

PCA cannot capture NON-LINEAR structure!

Page 56: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

56

PCA Conclusions• PCA

• finds orthonormal basis for data• Sorts dimensions in order of “importance”• Discard low significance dimensions

• Uses:• Get compact description• Ignore noise• Improve classification (hopefully)

• Not magic:• Doesn’t know class labels• Can only capture linear variations

• One of many tricks to reduce dimensionality!

Page 57: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

PCA Theory

Page 58: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

58

Justification of Algorithm II

GOAL:

Page 59: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

59

Justification of Algorithm II

x is centered!

Page 60: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

60

Justification of Algorithm IIGOAL:

Use Lagrange-multipliers for the constraints.

Page 61: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

61

Justification of Algorithm II

Page 62: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

Kernel PCA

Page 63: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

63

Kernel PCAPerforming PCA in the feature space

Lemma

Proof:

Page 64: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

64

Kernel PCA

Lemma

Page 65: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

65

Kernel PCAProof

Page 66: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

66

• How to use to calculate the projection of a new sample t?

Kernel PCA

Where was I cheating?

The data should be centered in the feature space, too!But this is manageable...

Page 67: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

67

Input points before kernel PCA

http://en.wikipedia.org/wiki/Kernel_principal_component_analysis

Page 68: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

68

Output after kernel PCA

The three groups are distinguishable using the first component only

Page 69: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

69

We haven’t covered...

• Artificial Neural Network Implementations

• Mixture of Probabilistic PCA

• Online PCA, Regret Bounds

Page 70: Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5

70

Thanks for the Attention!