how principal components analysis is different from factor
DESCRIPTION
one of the oldest sources of confusion. the title tells it all. how is princomp different from factor.TRANSCRIPT
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
How Principal Components Analysis is different from Factor Analysis and yet ends up with very similar results
10 April 2023
Contents:
Background and Intuition : 3-5
Principal Components Analysis : 6-15 Factor Analysis : 16-20
Comparison between PCA and Factor Analysis : 21-27 Cases and choice between PCA and Factor Analysis : 28-33
©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
A typical conversationAnalyst 1: im confused, should I run pca or factor analysis?Analyst 2: depends. If you are doing variable reduction or developing a ranking, pca is better. If you proposing a model for the observed variables, then factor analysis
Analyst1: so there is a difference between the two?Analyst2: yep
Analyst1: but both give very close communalitiesAnalyst2: yep but not always
Analyst1: can you tell me the difference between the two?Analyst2: yep
Analyst1: in non-mathematical terms?Analyst2: nope. pca is maths and factor analysis is stats. There is no layman analogue of eigenvectors and eigenvalues that I know of
Analyst1: but if in most cases they are similar, should I bother?Analyst2: if you trained in maths or stats, yes, or you wouldnt be able to sleep at night. If you are trained in market research, then no . Serious ans is it depends on data
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Some background: Lets remember the variance covariance matrix
n*1 vector of random variables Y Say, μi = E(Yi) where i=1,2,…n Then var (Yi)=E(Yi- μi)(Yi- μi)… (1)And cov(Yi)=E(Yi- μi)(Yi- μj) where i ne j...(2)Then (Yi- μi) (Yi- μi)’ gives the variance
covariance matrix where the diagonal elements are (1) and the off diagonal elements are (2)
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Some Intuition: What really is happening in PCA
Think that you are trying to solve this problem every time you take a photograph - Converting a 3d object into a 2d photograph with maximum details retained
If we take a high dimensional data vector in n space and project into a lower dimensional sub-space of n-k, k>0 dimensions, and do it such that the retained variance is maximised, we get PC
Note, there is no model involved here… just want to capture the maximum information in the photograph
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Principal Components Analysis
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Principal Components Analysis
Suppose that x is a vector of p random variables, and that the variances of the p random variables and the structure of the covariances or correlations between the p variables are of interest
Say we are lazy and simply don’t want to look at the p variances and all of the1/2{p(p − 1)} correlations or covariances
An alternative approach is to look for a few (<< p) derived variables that preserve most of the information given by these variances and correlations or covariances
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
What are Principal Components
Although PCA does not ignore covariances and correlations, it concentrates on variances
The way we would go about finding these PCs is so that the minimum number of PCs can explain maximum variance
The first step is to look for a linear function α’1x of the elements of x having maximum variance, where α1 is a vector of p constants α11, α12, . . . , α1p
Next, look for a linear function α’2x, uncorrelated with α’1x having maximum variance, and so on
These are the Principal Components
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
How to find them
Consider, for the moment, the case where the vector of random variables x has a known covariance matrix Σ
This is the famous variance covariance matrix whose (i,j)th element is the (known) covariance between the ith and jth elements of x when i not equal to j, and the variance of the jth element of x when i = j
Now two very important results:1. It turns out that for k = 1, 2, · · · , p, the kth PC is given by zk =
α’kx where αk is an eigenvector of Σ corresponding to its kth largest eigenvalue λk
2. Furthermore, if αk is chosen to have unit length (α’kαk = 1), then var(zk) = λk, where var(zk) denotes the variance of zk
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Normalization
To derive the form of the PCs, consider first α’1x; the vector α1 maximizes var[α’1x] = α1Σα1
It is clear the maximum will not be achieved for finite α1 so a normalization constraint must be imposed
The constraint used in the derivation is α’1α1 = 1, that is, the sum of squares of elements of α1 equals 1
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Maximization
To maximize α’1Σα1 subject to α’1α1 = 1, the standard approach is to use the technique of Lagrange multipliers
Maximise: α’1Σα1 − λ(α’1α1 − 1), where λ is the lagrange multiplier
Differentiation with respect to α1 givesΣα1 − λα1 = 0,….. (A)
OR, (Σ − λIp)α1 = 0, where Ip is the (p × p) identity matrix Thus, λ is an eigenvalue of Σ and α1 is the
corresponding eigenvector (from spectral decomposition, look up)
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Does it maximise
Now α1 is supposed to maximise the variance α’1Σα1
For that to happen, Σα1 − λα1 =0 must hold Or, α’1Σα1 = α’1 λα1 = λ α’1 α1 = λ So, the variance, when maximised would equal λ Which implies, that if we select the largest
eigenvalue λ and the eigenvector associated with it α1, we would be maximising the retained variance and α’1x would be the first PC
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
The second PC
The second PC, α’2x, maximizes α’2Σα2 subject to being uncorrelated with α’1x
Or equivalently subject to cov[α’1x,α’2x] = 0, where cov(x, y) denotes the covariance between the random variables x and y
Solving, we once again come down to maximising λ but it cant be equal to the largest eigenvalue since that is already taken by the first PC. So, λ=λ2, or the second largest eigenvalue of Σ
And so on
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
So what have we got?
It can be shown that for the first, second, third, fourth, . . . , pth PCs, the vectors of coefficients α1,α2,α3,α4, . . . ,αp are the eigenvectors of Σ corresponding to λ1, λ2,λ3, λ4, . . . , λp, the first, second, third and fourth largest, . . . , and the smallest eigenvalue, respectively
Also, var[α’kx] = λk for k = 1, 2, . . . , p.
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
So how does this compare with Factor Analysis
Principal component analysis has often been dealt with in textbooks as a special case of factor analysis, and this practice is continued by some widely used computer packages, which treat PCA as one option in a program for factor analysis
This view is misguided since PCA and factor analysis, as usually defined, are really quite distinct techniques
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Factor Analysis
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
So what is the Factor Analysis model
The basic idea underlying factor analysis is that p observed random variables, x, can be expressed, except for an error term, as linear functions of m (< p) hypothetical (random) variables or common factors
That is if x1, x2, . . . , xp are the variables and f1, f2, . . . , fm are the factors, then x1 = λ11f1 + λ12f2 + . . . + λ1mfm + e1 x2 = λ21f1 + λ22f2 + . . . + λ2mfm + e2 ... xp = λp1f1 + λp2f2 + . . . + λpmfm + ep
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Terminologies
λ are the factor loadings e’s are the error terms sometimes also called
specific factors. This is because, ej is specific to xj unlike fk are common to several xj
fk are the factors common to several x’s We will skip the additional details of the
factor analysis model since the objective is to demonstrate the difference between factor analysis and pca and not to explain the former
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Factor Analysis model continued
x = Λf + e (factor analysis model in matrix form) Now, going back to the 2 analysts we had met at the
start of this presentation: Analyst 2: while factor analysis and pca are both
dimension reduction techniques, factor analysis attempts to do so by proposing a model relating the observed to the latent variables. Pca has no such underlying model
In other words, the cameraman is just trying to take the best 2d representation of the 3d world (pca). He is not trying to fit a model to explain the world
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Estimation of the Factor analysis model parameters
Analyst 1: But, since both are trying to do the same thing, what if pca is used to solve the factor analysis model? Then there would be no difference between pca and factor analysis right?
Analyst 2: very good point. But pca explains all the variance and covariance of variance covariance matrix for a given data. Whereas, factor analysis explains only the common variance. Lets get back to models.
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Comparison between PCA and Factor Analysis
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Let us get back to PC
α’kΣαk is maximised by setting α’kΣαk = λk on slide 10
This maximises var(z=α’kx) but it maximises the variances along the diagonal of Σ as well as the off-diagonal covariances or correlations in it
So PCs explain diagonal elements = variance as well as off-diagonal = covariance/correlation elements of the variance-covariance matrix of the original data matrix x
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
What does Factor Analysis do?
If you remember the factor analysis model in matrix form: x = Λf + e
Along with the following assumptions, E[ee] = Ψ (diagonal) E[fe] = 0 (a matrix of zeros) E[ff] = Im (an identity matrix)
the above model implies that the variance covariance matrix would have the form:
Σ = ΛΛ +Ψ
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
What does the factor analysis variance covariance matrix imply
Σ = ΛΛ +Ψ Now, Ψ is a diagonal matrix, which means that its off diagonal
terms are zero So, the contribution of Ψ towards the off diagonal terms of Σ
would be nil Note that the relative contribution of ΛΛ and Ψ on diagonal terms
in Σ would depend on the nature of the variable xj in question If xj is highly correlated with all other variables then the
communalities would be large and specific variance or ej would be low
On the other hand, if xj is almost uncorrelated with the other variables then communalities would be low and ej would be large
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
So where does that leave us I
We have a data vector x This data vector has a variance covariance matrix
Σ The diagonal terms of which are the variances The off diagonal terms are the covariances
If the objective is dimension reduction or to get a ranking variable or something like image recognition etc. then we would do PCA
PCA would take care of diagonal as well as off-diagonal elements of the matrix Σ
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
So where does that leave us II
Now say we come up with a factor model that explains the data x
Using this model we can decompose the variance covariance matrix Σ = ΛΛ +Ψ
Now as soon as move to a factor model our objective changes from retaining the maximum variance (photography example) to uncovering the common latent factors driving the data (e.g.: psychology driving behaviour)
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
So where does that leave us III
That is, in factor analysis, we are interested in ΛΛ part of Σ only. In PCA we are interested in the entire Σ
To understand properly let us consider the following cases:
1. The variables in vector x are all correlated2. The variables in vector x are uncorrelated3. Some of the variables are correlated and
some are not
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Cases and choice between PCA and Factor Analysis
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Case I: The variables in vector x are all correlated
If we specify a factor analysis model then the elements of Ψ would be small and ΛΛ would be dominating Σ
In other words, the diagonal and off diagonal elements of Σ are all dominated by the common variation
So, a direct PCA to extract the factors would mostly extract common variation
Since this is the objective of factor analysis as well, in this case, pca and factor analysis would both give very close results
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Case I: The variables in vector x are all correlated (contd)
To see this intuitively let us consider the principal factor analysis, an alternative method to extract factors
What it does is, since in factor analysis we are interested in the common variation only, or
Σ - Ψ = ΛΛ, it applies PCA on Σ - Ψ rather than on entire Σ However, since all variables are correlated, Ψ has
small elements, so the difference between factors extracted by applying pca on Σ - Ψ vs Σ, is likely to be minimal
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Case II: The variables in vector x are uncorrelated
Again if we consider the factor analysis model, the diagonal elements of Σ would be dominated by specific variance from Ψ and the off diagonal elements would be really small
Here, an application of pca on Σ would extract PCs that take care of only specific variance and not common variation. So they would be very different from the factors
Factors drawn through pfa would be different from the components but the model would not hold since there is no correlation
In this case, factor analysis doesnt make sense which should be clear from the correlation matrix it self
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Case III: Some of the variables are correlated and some are not
Consider two variables: xi and xj xi is highly correlated with the rest of the variables Then in Σ, both the diagonal and off-diagonal elements of xi are
dominated by common variation xj on the other hand is not correlated with the rest of the
variables In Σ, the diagonal element of xj is dominated by specific
variation from Ψ Applying pca on Σ, would once again consider both common
and specific variation making the components different from factors
Its better to apply pfa since it would strip Σ of the specific variance due to Ψ
10 April 2023 ©Arup Guha - Indian Institute of Foreign Trade - New Delhi, India
Choice
Now that we know where the choice between pca and factor analysis is trivial in practice (they always make theoretical difference) and where it is not, how do we choose?
The answer is the most basic step of statistics Know your data Or more specifically, know the correlation structure of your data
However, since this is difficult and a judgement call it is always advisable to use non-pca techniques for factor analysis, lest you come up with factors that also contain specific variation