Mutidimensional Data AnalysisMutidimensional Data Analysis
Growth of big databases requires important data processing.Growth of big databases requires important data processing.
Need for having methods allowing to extract this Need for having methods allowing to extract this information from large data tables.information from large data tables.
Tree categories of Data Analysis methods :Tree categories of Data Analysis methods :
Description : Description : to describe a phenomenon without prejudiceto describe a phenomenon without prejudice
Structuring : Structuring : to synthesize information by structuring the population to synthesize information by structuring the population in homogeneous groupsin homogeneous groups
Explanation : Explanation : to determine the observed values of a variable by to determine the observed values of a variable by means of those observed for other variables.means of those observed for other variables.
Mutidimensional Data Mutidimensional Data AnalysisAnalysis
one-dimensional descriptive statistics: summarize information for each one-dimensional descriptive statistics: summarize information for each character.character.
Data Analysis : describe relations between characters and their effects Data Analysis : describe relations between characters and their effects on the structuring of the population.on the structuring of the population.
Principal Component Analysis (PCA)Principal Component Analysis (PCA)
Factorial correspondences analysisFactorial correspondences analysis (FCA) (FCA)
Principal Component Principal Component AnalysisAnalysis
PCA is used when we have a measure data table. PCA is used when we have a measure data table. Here an example of a measure data file. :Here an example of a measure data file. :
Columns : quantitative variablesColumns : quantitative variables
Rows : observationsRows : observations
observations
height weight Pulmonary capacity
Durand 1.77 72.4 2.69
Dupont 1.52 68.0 3.90
Dupond 1.64 68.0 3.40
Martin 1.76 50.0 2.00
Objectives of the ACP :Objectives of the ACP :
locate homogeneous groups of observations, across from the set of locate homogeneous groups of observations, across from the set of variables.variables.
A large number of variables can be systematically reduced to a A large number of variables can be systematically reduced to a smaller, conceptually more coherent set of variables.smaller, conceptually more coherent set of variables.
From the set of the initial statistical variables we can build explicative From the set of the initial statistical variables we can build explicative artificial statistical variables. artificial statistical variables.
The principal components are a linear combination of the original The principal components are a linear combination of the original variables.variables.
Its goal is to reduce the dimensionality of the original data set. Its goal is to reduce the dimensionality of the original data set.
A small set of uncorrelated variables is much easier to understand A small set of uncorrelated variables is much easier to understand and use in further analyses than a large set of correlated variables.and use in further analyses than a large set of correlated variables.
Principal Component Principal Component AnalysisAnalysis
3 types of PCA3 types of PCA
General PCA : General PCA : apply PCA method to the initial Data Table.apply PCA method to the initial Data Table.
Centered PCACentered PCA : apply PCA method to the centered variables. : apply PCA method to the centered variables.
reduced PCAreduced PCA : apply PCA method to the centered and reduced variables : apply PCA method to the centered and reduced variables
Principal Component Principal Component AnalysisAnalysis
Principal Component Principal Component AnalysisAnalysis
- Centered PCA- Centered PCA
X : statistical variable
n
iiXn
X1
1
XXY
(Mean of X)
(Centered variable)
Reduced PCAReduced PCA
Principal Component Principal Component AnalysisAnalysis
n
iiX XX
n 1
21
X
i XXZ
(Centered and reduced variable)
The PCA provides a method of representation of a population in order to The PCA provides a method of representation of a population in order to
Locate homogeneous groups of observations, across from the variables. Locate homogeneous groups of observations, across from the variables.
Reveal differences between observations or groups of observations, across from Reveal differences between observations or groups of observations, across from the set of variables. the set of variables.
Highlight observations with the atypical behavior. Highlight observations with the atypical behavior.
Reduce the information which allows to describe the position of an observation in Reduce the information which allows to describe the position of an observation in the set of the population. the set of the population.
Principal Component Principal Component AnalysisAnalysis
The populationThe population
defined variables on the populationdefined variables on the population . .
Example:Example:
PrinciplePrinciple
Principal Component Principal Component AnalysisAnalysis
Two types of analysisTwo types of analysis
Analysis of the observationsAnalysis of the observations. . Analysis of the variablesAnalysis of the variables. .
The reduced analysis:The reduced analysis:
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
Each observation is represented by a point in a three dimensional space.Each observation is represented by a point in a three dimensional space.
How to compute a distance between two observations?How to compute a distance between two observations?
Principal Component Principal Component AnalysisAnalysis
The distance measures the resemblance between these two observations.The distance measures the resemblance between these two observations. More the distance is small more the two points are nearby and thus more the More the distance is small more the two points are nearby and thus more the two observations resemble each other.two observations resemble each other.
ConverselyConversely, more the distance is large, more the points are distant and less , more the distance is large, more the points are distant and less the observations resemble each other.the observations resemble each other.
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
The 3 axes are defined by variables YThe 3 axes are defined by variables Y11(.), Y(.), Y22(.) and Y(.) and Y33(.) calculated from (.) calculated from
initial variablesinitial variables
The distance between two observations The distance between two observations ii and and kk is given by : is given by :
It is impossible to carry out a representation of the observations in a dimensional It is impossible to carry out a representation of the observations in a dimensional space greater than 3.space greater than 3.
It is thus necessary to find a good representation of the observations group in a It is thus necessary to find a good representation of the observations group in a space of lower size (2 for example). space of lower size (2 for example).
How to pass from a space of size greater or equal to 3 at a space of more restricted How to pass from a space of size greater or equal to 3 at a space of more restricted size?size?
Look for a "good subspace" of representation by using a mathematical operator.Look for a "good subspace" of representation by using a mathematical operator.
Two problems are posed :Two problems are posed :
1.1. Give a meaning to the expression "good representation", Give a meaning to the expression "good representation", 2.2. Characterize the subspaceCharacterize the subspace
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
To find under space F such that the To find under space F such that the distance between points is preserved in distance between points is preserved in the process of projection on this the process of projection on this subspace. subspace.
Thus, the resemblance between Thus, the resemblance between observations is preserved in this observations is preserved in this operation of projection operation of projection
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
Find a sub-space F such asFind a sub-space F such as :
Solution :Solution :To determine the subspace F, of dimension q, by determining q first eigenvalues and q To determine the subspace F, of dimension q, by determining q first eigenvalues and q eigenvectors associated of the matrix Y' Y eigenvectors associated of the matrix Y' Y
934.294.8
34.29275.1
94.8275.19
9
1'YY
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
(Correlation matrix)
Principal Component Principal Component AnalysisAnalysis
Z=Y’.YZ=Y’.Y
11, , 22, , 33 …. …. mm : Eigenvalues of Z : Eigenvalues of Z
uu11, u, u22, u, u33 …. u …. umm : Eigenvectors of Z : Eigenvectors of Z
Z. Z. uu11 : : Vector of the n observations coordinates on the first Vector of the n observations coordinates on the first
principal axisprincipal axis
Z. Z. uu22 : : Vector of the n observations coordinates on the second Vector of the n observations coordinates on the second
principal axisprincipal axis
………..
Z. Z. uumm : : Vector of the n observations coordinates on the mVector of the n observations coordinates on the mthth
principal axisprincipal axis
Observations analysisObservations analysis
It is necessary to build indicators to know quality of the obtained results. It is necessary to build indicators to know quality of the obtained results.
These indicators are : These indicators are :
an indicator of global quality an indicator of global quality
an indicator of contribution of the observationan indicator of contribution of the observation to total inertia to total inertia
an indicator of contribution of the observationan indicator of contribution of the observation to the inertia explained by the subspace F to the inertia explained by the subspace F an indicator of error of perspective. an indicator of error of perspective.
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
Global qualityGlobal quality
Eigenvalues of Y'Y :Eigenvalues of Y'Y :
The One dimensional Subspace F : we obtain IQG(F) = 0.6896. The One dimensional Subspace F : we obtain IQG(F) = 0.6896. The first axis (of the analysis) provide 68.96% of initial information.The first axis (of the analysis) provide 68.96% of initial information.
The subspace generated by the two first axis : IQG(F)=1. (100% of initial info.)The subspace generated by the two first axis : IQG(F)=1. (100% of initial info.)
1 = 0.689
2 = 0.310
3 = 0.00
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
q : subspace dimensionq : subspace dimension
n : number of variablesn : number of variables
Eigenvalues numbered in the descending order Eigenvalues numbered in the descending order
Contribution of the observation to total inertiaContribution of the observation to total inertia
CIT(CIT(ii) = 1. ) = 1.
CIT allows to locate easily the observations far CIT allows to locate easily the observations far distant from center of gravity.distant from center of gravity.
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
N : number of individuals (observations) in N : number of individuals (observations) in the CPAthe CPA
contribution of the observation to the inertia explained by the subspacecontribution of the observation to the inertia explained by the subspace
The CIE determines the observations which The CIE determines the observations which contribute more to create a subspace F. contribute more to create a subspace F.
In general, this parameter is calculated for In general, this parameter is calculated for all the observations for each axisall the observations for each axis
CIE values for nine observations CIE values for nine observations of our example.of our example.
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
Error of perspectiveError of perspective ::
COSCOS22(.,.) has the following properties(.,.) has the following properties: :
Principal Component Principal Component AnalysisAnalysis
Observations analysisObservations analysis
The quality of representation of an observation on the subspaceThe quality of representation of an observation on the subspace
Objective: Objective: to determine synthetic statistical variables which "explain" the initial to determine synthetic statistical variables which "explain" the initial variables. variables.
Problem: to fix the criterion which allows to determine these synthetic variables, Problem: to fix the criterion which allows to determine these synthetic variables, then to interpret these variables. then to interpret these variables.
In our example, the problem can be posed mathematically as following :In our example, the problem can be posed mathematically as following :
Variables analysisVariables analysis
Principal Component Principal Component AnalysisAnalysis
YY11(.), Y(.), Y22(.) and Y(.) and Y33(.) are explained linearly by the synthetic variables Z(.) are explained linearly by the synthetic variables Z11(.) and Z(.) and Z22 (.) (.)
dd11 (.), d (.), d22 (.) and d (.) and d33 (.) are the residual variables, which one want to minimize the (.) are the residual variables, which one want to minimize the
variancesvariances
aaij ij are the solutions of the optimization problem :are the solutions of the optimization problem :
Min ( V((Min ( V((dd11(.))+V(d(.))+V(d22(.))+V(d(.))+V(d33(.)))(.))) V(dV(dii(.) : variance of d(.) : variance of dii(.)(.)
Solution : Solution : calculation of the eigenvectors associated to q greater eigenvalues calculation of the eigenvectors associated to q greater eigenvalues of matrix YY 'of matrix YY '
NoticeNotice : Matrix YY' has the same no null eigenvalues as the matrix Y' Y. : Matrix YY' has the same no null eigenvalues as the matrix Y' Y.
These two eigenvectors define the These two eigenvectors define the two sought synthetic variables.two sought synthetic variables.
Principal Component Principal Component AnalysisAnalysis
Variables analysisVariables analysis
The same previous indicators are used in the variables analysis.The same previous indicators are used in the variables analysis.
A significant indicator is IQG(F), the indicator of quality of the subspace F (in which A significant indicator is IQG(F), the indicator of quality of the subspace F (in which the variables are projected).the variables are projected).
This indicator allows to calculate the "residual variance” (not taken into account This indicator allows to calculate the "residual variance” (not taken into account in the representation by the subspace):in the representation by the subspace):
Residual variance = m.[1 - IQG(F)]Residual variance = m.[1 - IQG(F)]
Principal Component Principal Component AnalysisAnalysis
Variables analysisVariables analysis
it is shown that the coordinate of the projection of a variable on an axis of the it is shown that the coordinate of the projection of a variable on an axis of the subspace is proportional to the linear coefficient of correlation between this variable subspace is proportional to the linear coefficient of correlation between this variable and the "synthetic" variable corresponding to the axis:¶and the "synthetic" variable corresponding to the axis:¶
Note: Taking into account this proportionality, the program carries out a Note: Taking into account this proportionality, the program carries out a calculation of reduction which involves that the co-ordinates of projected calculation of reduction which involves that the co-ordinates of projected variables on each axis are directly the linear coefficients of correlation.variables on each axis are directly the linear coefficients of correlation.
Principal Component Principal Component AnalysisAnalysis
Variables analysisVariables analysis
For each variable, the coefficient of multiple correlation with the variables For each variable, the coefficient of multiple correlation with the variables corresponding to the axes of a subspace F on which it is projected is proportional corresponding to the axes of a subspace F on which it is projected is proportional to the square of the norm of the projected vector.to the square of the norm of the projected vector.
A variable will be explained better by the axis of a subspace when the norm of A variable will be explained better by the axis of a subspace when the norm of the projected associated vector is large.the projected associated vector is large.
Principal Component Principal Component AnalysisAnalysis
Variables analysisVariables analysis
Simulated exampleSimulated example
Number of variables : 8Number of variables : 8Number of observation : 300Number of observation : 300
Num X1 X2 X3 X4 X5 X6 X7 X8
1 1.692 3.046 -7.4612 -2.0368 2.512 2.9584 0.8168 -1.2608
2 18.316 11.358 -25.7476 -8.6864 1.952 2.5664 0.0328 -0.7568
3 16.377 10.3885 -23.6147 -7.9108 1.82 2.474 -0.152 -0.638
4 6.688 5.544 -12.9568 -4.0352 1.918 2.5426 -0.0148 -0.7262
5 -2.666 0.867 -2.6674 -0.2936 2.291 2.8037 0.5074 -1.0619
6 7.103 5.7515 -13.4133 -4.2012 1.118 1.9826 -1.1348 -0.0062
7 12.558 8.479 -19.4138 -6.3832 1.664 2.3648 -0.3704 -0.4976
8 9.064 6.732 -15.5704 -4.9856 2.316 2.8212 0.5424 -1.0844
9 10.668 7.534 -17.3348 -5.6272 1.436 2.2052 -0.6896 -0.2924
10 7.136 5.768 -13.4496 -4.2144 2.514 2.9598 0.8196 -1.2626
Principal Component Principal Component AnalysisAnalysis
Linear correlation Linear correlation between the variablesbetween the variables ¶ ¶
Variables non Variables non correlated with Xcorrelated with X11
--- Eigenvalues - Cumulated - Cumulated percentage 1 4.09407 4.09407 0.51176 2 3.90593 8.00000 1.00000 3 0.00000 8.00000 1.00000 4 0.00000 8.00000 1.00000 5 0.00000 8.00000 1.00000 6 0.00000 8.00000 1.00000 7 0.00000 8.00000 1.00000 8 0.00000 8.00000 1.00000
Les valeurs propresLes valeurs propres
100% 100% of inertia of inertia is obtained is obtained with the two with the two
first axesfirst axes
Variables Variables coordinatescoordinates
U1 U2 X1 0.7150.715 0.6990.699 X2 0.7150.715 0.6990.699
X3 -0.715 -0.699-0.715 -0.699 X4 -0.715 -0.699-0.715 -0.699 X5 -0.715 0.699-0.715 0.699 X6 -0.715 0.699-0.715 0.699 X7 -0.715 0.699-0.715 0.699 X8 0.715 -0.6990.715 -0.699 11
11
00-1-1
-1-1
#1#1
#2#2
X1X1
X2X2
X3X3
X4X4
X5X5X6X6
X7X7
X8X8U1 : First principal component
All the variables are located inside a All the variables are located inside a unit circle (Reduced ACP)unit circle (Reduced ACP)
11
11
00-1-1
-1-1
#1#1
#2#2
X1X1
X2X2
X3X3
X4X4
X5X5X6X6
X7X7
X8X8
two dimensions are two dimensions are highlightedhighlighted
Variables coordinatesVariables coordinates
Projection des individus
-0,3
-0,2
-0,1
0
0,1
0,2
0,3
-0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4
Axe 1
Axe
2
observations observations coordinatescoordinates
Factorial Correspondences AnalysisFactorial Correspondences Analysis
The factorial correspondences analysis is used to extract information The factorial correspondences analysis is used to extract information starting from the contingency tables.starting from the contingency tables.
contingency tablescontingency tables (Frequency tables) : crossing of 2 variables X and Y.: crossing of 2 variables X and Y.
X : m modalities X : m modalities Y : p modalitiesY : p modalities
Objectives of FCAObjectives of FCA
To build a modalities map of two variables X and Y.To build a modalities map of two variables X and Y.
To determine if there are correlations between certain modalities of X and To determine if there are correlations between certain modalities of X and some modalities of Y.some modalities of Y.
Example :Example :2 variables : ward and expenditure.2 variables : ward and expenditure.
5 wards (division in hospital) 5 wards (division in hospital) 5 expenditures (post of expenditure)5 expenditures (post of expenditure)
Factorial Correspondences AnalysisFactorial Correspondences Analysis
A row modality A row modality is represented by a point of a p dimensions spaceis represented by a point of a p dimensions space
(27 18 12 19 8) represents second row(27 18 12 19 8) represents second row
Row 2 : Point of RRow 2 : Point of R55
A rowA row modality modality : 5 points in 5 dimensions space: 5 points in 5 dimensions space
Analysis of row modalitiesAnalysis of row modalities
Factorial correspondences AnalysisFactorial correspondences Analysis
How to find a subspace of reduced size Q (q=2 for example) to How to find a subspace of reduced size Q (q=2 for example) to represent these points?represent these points?
The distance between "points represented" (in the subspace) must The distance between "points represented" (in the subspace) must be the nearest distance between the initial points.be the nearest distance between the initial points.
one must define a distance between the points (between one must define a distance between the points (between modalities).modalities).
A row modality is represented by a A row modality is represented by a vector Xvector Xii whose his coordinates are whose his coordinates are
computed by :computed by :
ji
ijij
ff
fXj
..
,
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Distance between two modalities is given by :Distance between two modalities is given by :
This distance is called Chi-square distanceThis distance is called Chi-square distance
Example : distances between Example : distances between modalities of wards are given modalities of wards are given in this table :in this table :
Factorial Correspondences AnalysisFactorial Correspondences Analysis
The problem formulationThe problem formulation
Find a q-dimensional subspace F, where :Find a q-dimensional subspace F, where :
is maximizedis maximized
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Center of gravity of xCenter of gravity of xII having a weight f having a weight fl.l.
Centering operationCentering operation
Each vector zEach vector zii has p coordinates noted z has p coordinates noted zijij. .
We can define a Matrix Z where the general We can define a Matrix Z where the general term is : zterm is : zijij
It is shown that the q-dimensional subspace F is generated by It is shown that the q-dimensional subspace F is generated by the eigenvectors of the matrix Z' Z the eigenvectors of the matrix Z' Z
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Example : Center of gravityExample : Center of gravity
Vector xVector xii
Vector yVector yii
Matrix ZMatrix Z
Eigenvalues :Eigenvalues :
1 = 0.011 = 0.01
2 = 0.001762 = 0.00176
3 = 03 = 0
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Quality of representation indicators Quality of representation indicators
Quality of sub-space engendered :Quality of sub-space engendered :
q : dimension of sub-spaceq : dimension of sub-space
P : number of column modalitiesP : number of column modalities
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Contribution of a row modality i to making axis k:Contribution of a row modality i to making axis k:
0 0 CIE(i,u CIE(i,ukk) ) 1. 1.
if CIE is close to 1, the rowif CIE is close to 1, the row modality has a significant modality has a significant weight in the determination weight in the determination of the subspace F.of the subspace F.
Example : Contribution of Example : Contribution of row modalitiesrow modalities
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Quality of representation (perspective effect) :Quality of representation (perspective effect) :
measure the degree of deformation during projection.measure the degree of deformation during projection.
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Columns modalities analysis :Columns modalities analysis :
Columns modalities are analyzed same manner as the rows Columns modalities are analyzed same manner as the rows modalities.modalities.
Coordinates of xCoordinates of xii are such as: are such as:
The matrices Z' Z and ZZ' have the The matrices Z' Z and ZZ' have the same ones no null eigenvaluessame ones no null eigenvalues
Factorial Correspondences AnalysisFactorial Correspondences Analysis
contributions of contributions of columns modalitiescolumns modalities
quality of representation quality of representation of columns modalitiesof columns modalities
These indicators have the same definitions, adapted to the These indicators have the same definitions, adapted to the columns modalitiescolumns modalities
Factorial Correspondences AnalysisFactorial Correspondences Analysis
the simultaneous representation of the rows and the columns projected the simultaneous representation of the rows and the columns projected in the first factorial plane (axes 1 and 2) of our examplein the first factorial plane (axes 1 and 2) of our example
Factorial Correspondences AnalysisFactorial Correspondences Analysis
Illustrations