pi i lc ta l iprincipal component analysis a brief ... · 19-04-2011  · how old is principal...

30
Pi i lC tA l i Principal Component Analysis A Brief Introduction A Brief Introduction Mi f Li PhD DCS Mingfu Liu. PhD, DCS Methodology Journal Club, University of Calgary April 19, 2011 1

Upload: others

Post on 08-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

P i i l C t A l iPrincipal Component Analysis

A Brief IntroductionA Brief Introduction

Mi f Li PhD DCSMingfu Liu. PhD, DCS

Methodology Journal Club, University of CalgaryApril 19, 2011

1

Page 2: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

How Old is Principal Component Analysis

Pearson, K. (1901) On lines and planes of closest fit to systems of points in space Philosophical Magazine 2 559-572points in space. Philosophical Magazine, 2, 559-572

Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417-p p p y gy, ,441.

2

Page 3: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Principal Components Visual Presentation

3

Page 4: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

How do you describe him?How do you describe him?1. How many attributes can you get from him?2. How do you describe him generally?

)

4

Page 5: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Now we have p variables from a single population of size nNow we have p variables from a single population of size n

X11, X12, X13, …, X1pX21, X22, X23, …, X2p21 22 23 2p….Xn1, Xn2, Xn3, …, Xnp

Which variables should be used to represent the characteristics of the population?

How do we classify these variables, independent or dependent?

5

Page 6: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Which variables should be used to represent the characteristics of the population?

The simplet way is to keep one variable and discard all others: not reasonable!

Wheigt all variable equally: not reasonable (even they have same variance)

Wheigted average based on some citerion.

6

Which criterion?.

Page 7: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

The weighted average f(X1, X2, X3, … Xp) seems reasonable.

If this is true, the Xs are independent variables. Where are the dependent variables? In this case, the dependent p , pvariables are unobservable latent variables ( we assume they are dependent variables for now).

We need to set up criterion to find the function –> Principal Component Analysis

7

Page 8: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Principal Component Analysis

Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into anumber of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal componentscalled principal components.

Objectives of principal component analysis

To discover or to reduce the dimensionality of the data set.

To identify new meaningful underlying variables.

8

Page 9: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Transformation• Look for a transformation of the original data vector X (px1) Look for a transformation of the original data vector X (px1)

so that new variable - principal component (Yi ) can bedefined as

Yi= α1TX= α11 X1+ α12 X2+..+ α1p Xp

… Yi= αi

TX = αi1 X1+ αi2 X2+..+ αip Xpi i i1 1 i2 2 ip p

… Yp= αp

TX= αp1 X1+ αp2 X2+..+ αpp Xp

• Where αi =(αi1 , αi2 ,.., αip)T is a column vector of wheightswith

αiTαi = αi1²+ αi2²+ + αi ² =1 to restrict the vector to

9

αi αi αi1 + αi2 +..+ αip 1 to restrict the vector to unit length to eliminate indetermincy

Page 10: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Variance MaximizationMaximize the variance of the projection of the observations on the principal Maximize the variance of the projection of the observations on the principal component (Yi) to find a vector αi for each principal component (Yi).

Var(Y1) = var(α1T X)= α1

TVar(X) α1 is maximal…

Var(Yi) = var(αiT X)= αi

TVar(X) αi is maximal…

Var(Yp) = var(αpT X)= αp

TVar(X) αp is maximalVar(Yp) var(αp X) αp Var(X) αp is maximal

The matrix C=Var(X) is the variance matrix of the X variables

T i i f ti f l i bl bj t t t i t thTo maximize a function of several variables subject to one or more constraints, the method of Lagrange multipliers is used.

In this case this leads to the solution that αi is the ith eignvector

10

i gof the variance matrix C.

Page 11: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Variance Matrix

⎟⎟⎞

⎜⎜⎛ ........)( 1211 p ),xc(x),xc(xxv

⎟⎟⎟⎟

⎜⎜⎜⎜

........)( 2221 p ),xc(xxv),xc(xC=

⎟⎟

⎠⎜⎜

⎝ )(..........21 ppp xv),xc(x),xc(x

C has p eignvalue–eignvector (latent value – latent vector) pairs (λ1 , α1), (λ2 , α2), … (λp , αp) corresponding to the variances and coefficient vectors of the p principal components, , where λ1 >= λ2 >= λ3 >= … >= λp

11

1 2 3 … p

Page 12: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Principal Component variance - Eignvalue

A P i i l C t h i l t th A Principal Component has a variance equal to the corresponding eigenvalue

Var(Yi)= λi for all i=1…pa ( i) λi o …p

Small λi small variance data change little in the direction of the component Yi

Principal Components Yi are derived in decreasing order of importance λ1 >= λ2 >= λ3 >= … >= λp

Th l ti i l i d b h i i l t The relative variance explained by each principal component is given by λi /Σ λi

12

Page 13: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Principal Component weights - Eignvector

Principal Component Y that is linear combination of the Principal Component Yi that is linear combination of the original variables (X) is calculated using eignvector αi as weights

Y1 = α1TX = a11x1+a12x2+…+a1pxp

Y2 = α2TX = a21x1+a22x2+…+a2pxp

…Y = α TX = a 1x1+a 2x2+ +a xYp αp X ap1x1+ap2x2+…+appxp

or

Y TX i 1Yi = αiTX = ai1x1+ai2x2+…+aipxp ; i=1..p

As the eignvectors are orthogonal (uncorrelated) to eachanother, Principal Components are orthogonal to each

13

, p p ganother

Page 14: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Principal Components Visual Presentation

14

Page 15: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Pearson’s Visual Presentation

15

Page 16: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Key Points

What we need to remember for now are:

1. We transform original variables Xs to Principal Components Ys using eignanalysis

2.The eignvalues are the corresponding variances of the principal components in decreasing order of importance

3. The eignvectors are the corresponding weight sets for the principal componentswhich are linear combinations of the original variables

4. Principal Components are uncorrelated4. Principal Components are uncorrelated

16

Page 17: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Eignanalysis – Square Matrix Decomposition and Diagonalization

The eigenvalues λi are found by solving the equation q

det(C-λI)=0

Eigenvectors are columns of the matrix A such that

C=A D AT⎟⎟⎞

⎜⎜⎛λ 0........01

C=A D AT

Where D = ⎟⎟⎟⎟⎟

⎠⎜⎜⎜⎜⎜

⎝ λ

λ

00

0.......0 2

17

⎟⎠

⎜⎝ pλ............0

Page 18: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Application StrategyKeep enough principal components to have a cumulative variance explained by them >50-70%

Kaiser criterion: keep principal components with eigenvalues >1eigenvalues >1

Scree plot: represents the ability of principal p p y p pcomponents to explain the variation in data

18

Page 19: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Scree Plot

19

Page 20: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Standardization of Original Varibales

If variables have very heterogenous variances we standardize them The standardized variables Zi

Z (X mean)/√varianceZi= (Xi-mean)/√variance

The new variables all have the unit The new variables all have the unit variance (=1)

20

Page 21: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Correlation Matrix

When the original variables are standardized, covariance Matrix becomes correlation Matrix.

W l th th d t d P i i l C t A l i Th lWe apply the same methods to do Principal Component Analysis. The only difference is that C matrix is replaced by R matrix

R =

21

Page 22: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Covariance Matrix or Correlation Matrx?

C i M i i d hCovariance Matrix is used when

1 The original variables have the same1. The original variables have the sameunit/scale

2. The original variables have similar varianceg

Otherwise, Correlation Matrix should be used

22

Page 23: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Simple Example

1 4

4 100C = λ1 = 100.16 α1

T = [0.04, 0.990] Y1 = 0.04 X1 + 0.999X2

λ2 = 0.84 α2T

= [ 0.999, -0.04] Y2 = 0.999 X1 - 0.04X2

1 .4

.4 1 R=

λ1 = 1.4 α1T

= [0.707, 0.707 ] Y1 = 0.707 Z1 + 0.707Z2 = 0.707 (X1 – M1) + 0.0707 (X2 – M2) λ2 = 0.6 α2

T = [ 0.707, -0.707] Y2 = 0.707 Z1 - 0.707Z2 = 0.707 (X1 – M1) - 0.0707 (X2 – M2)

23

Page 24: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Something to think about

Are original variables really independent variables?

It depends on 1 H t h t th bl1. How to approach to the problem2. What methods to use to solve the problem

24

Page 25: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

How to Approach to the Problem

1. When we treat the original variables as analytical input as we have done, they are independent variables

2. When we treat the original variables as realization of the underlying latent variables they are dependent variablesvariables, they are dependent variables

25

Page 26: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

What methods to use to solve the problem

Three Methods are Available:

1. Maximizing Variance 2. Minimizing Error3. Diagonalizing the Correlation Matrix

When minimizing error method is used original variables are dependent variablesWhen minimizing error method is used, original variables are dependent variables

26

Page 27: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

What methods to use to solve the problem

X = YB+ E

where

X is an n x p matrix of the centered observed variables; Y is the n x j matrix of scores on the first j principal components; B is the j x p matrix of eigenvectors;B is the j x p matrix of eigenvectors; E is an n x p matrix of residuals;

The method is to minimize the sum of all the squared elements in E

In this case the original variables are dependent variables just like SEM and latent models.

27

Page 28: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Key Points To Take Home

1. We transform original variables Xs to Principal Components Ys using eignanalysis

2. The eignvalues are the corresponding variances of the principal components indecreasing order of importance.

3 The eignvectors are the corresponding weight sets for the principal components3. The eignvectors are the corresponding weight sets for the principal componentswhich are linear combinations of the original variables

4. Principal Components are uncorrelated

28

Page 29: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

Principal Component Analysis

Questions?Questions?

29

Page 30: Pi i lC tA l iPrincipal Component Analysis A Brief ... · 19-04-2011  · How Old is Principal Component Analysis Pearson, K. (1901) On lines and planes of closest fit to systems

SAS Example Demo

30