stat 315c: transposable data correspondence analysis
TRANSCRIPT
Stat 315c: Transposable DataCorrespondence Analysis
Art B. Owen
Stanford Statistics
Art B. Owen (Stanford Statistics) Correspondence Analysis 1 / 17
Correspondence Analysis
It plots both variables and cases in the same plane.
Clearest motivation is for contingency table data. It gets usedelsewhere too.
Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.
This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.
The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17
Correspondence Analysis
It plots both variables and cases in the same plane.
Clearest motivation is for contingency table data. It gets usedelsewhere too.
Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.
This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.
The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17
Correspondence Analysis
It plots both variables and cases in the same plane.
Clearest motivation is for contingency table data. It gets usedelsewhere too.
Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.
This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.
The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17
Correspondence Analysis
It plots both variables and cases in the same plane.
Clearest motivation is for contingency table data. It gets usedelsewhere too.
Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.
This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.
The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17
Correspondence Analysis
It plots both variables and cases in the same plane.
Clearest motivation is for contingency table data. It gets usedelsewhere too.
Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.
This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.
The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17
Correspondence Analysis
It plots both variables and cases in the same plane.
Clearest motivation is for contingency table data. It gets usedelsewhere too.
Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.
This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.
The treatment by Greenacre is particularly clear.
Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17
Contingency tables
I Ć J table of counts
n11 n12 Ā· Ā· Ā· n1J
n21 n22 Ā· Ā· Ā· n2J...
.... . .
...nI1 nI2 Ā· Ā· Ā· nIJ
Nomenclature
Correspondence matrix P pij = nij/nā¢ā¢
Row masses ri = piā¢ = niā¢/nā¢ā¢
Column masses cj = piā¢ = niā¢/nā¢ā¢
Row profiles ri = (pi1/ri, . . . , piJ/ri)ā² ā RJ
Column profiles cj = (p1j/cj , . . . , pIj/cj)ā² ā RI
These are conditional and marginal distributions
Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17
Contingency tables
I Ć J table of counts
n11 n12 Ā· Ā· Ā· n1J
n21 n22 Ā· Ā· Ā· n2J...
.... . .
...nI1 nI2 Ā· Ā· Ā· nIJ
Nomenclature
Correspondence matrix P pij = nij/nā¢ā¢
Row masses ri = piā¢ = niā¢/nā¢ā¢
Column masses cj = piā¢ = niā¢/nā¢ā¢
Row profiles ri = (pi1/ri, . . . , piJ/ri)ā² ā RJ
Column profiles cj = (p1j/cj , . . . , pIj/cj)ā² ā RI
These are conditional and marginal distributions
Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17
Contingency tables
I Ć J table of counts
n11 n12 Ā· Ā· Ā· n1J
n21 n22 Ā· Ā· Ā· n2J...
.... . .
...nI1 nI2 Ā· Ā· Ā· nIJ
Nomenclature
Correspondence matrix P pij = nij/nā¢ā¢
Row masses ri = piā¢ = niā¢/nā¢ā¢
Column masses cj = piā¢ = niā¢/nā¢ā¢
Row profiles ri = (pi1/ri, . . . , piJ/ri)ā² ā RJ
Column profiles cj = (p1j/cj , . . . , pIj/cj)ā² ā RI
These are conditional and marginal distributions
Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17
First moments: centroids
Row centroid
Iāi=1
riri =Iā
i=1
ri
(pi1
ri, . . . ,
piJ
ri
)ā²= (c1, . . . , cJ)ā² ā” c
Column centroid
Jāj=1
cjcj = (r1, . . . , rI)ā² ā” r
Upshot
āMassā weighted average of row profiles is marginal distribution overcolumns
Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17
First moments: centroids
Row centroid
Iāi=1
riri =Iā
i=1
ri
(pi1
ri, . . . ,
piJ
ri
)ā²= (c1, . . . , cJ)ā² ā” c
Column centroid
Jāj=1
cjcj = (r1, . . . , rI)ā² ā” r
Upshot
āMassā weighted average of row profiles is marginal distribution overcolumns
Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17
First moments: centroids
Row centroid
Iāi=1
riri =Iā
i=1
ri
(pi1
ri, . . . ,
piJ
ri
)ā²= (c1, . . . , cJ)ā² ā” c
Column centroid
Jāj=1
cjcj = (r1, . . . , rI)ā² ā” r
Upshot
āMassā weighted average of row profiles is marginal distribution overcolumns
Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17
Second moments: inertias
Chisquare for independence as weighted Euclidean distance
X2 =ā
i
āj
(nij ā niā¢nā¢j/nā¢ā¢)2
niā¢nā¢j/nā¢ā¢
=ā
i
niā¢
āj
(nij/niā¢ ā nā¢j/nā¢ā¢)2
nā¢j/nā¢ā¢
= nā¢ā¢
āi
niā¢
nā¢ā¢
āj
(nij/niā¢ ā nā¢j/nā¢ā¢)2
nā¢j/nā¢ā¢
= nā¢ā¢
āi
ri(ri ā c)ā²diag(c)ā1(ri ā c)
= nā¢ā¢ Ć Inertia
This is the total inertia of the row profiles. It equals total inertia ofcolumn profiles.
Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17
Second moments: inertias
Chisquare for independence as weighted Euclidean distance
X2 =ā
i
āj
(nij ā niā¢nā¢j/nā¢ā¢)2
niā¢nā¢j/nā¢ā¢
=ā
i
niā¢
āj
(nij/niā¢ ā nā¢j/nā¢ā¢)2
nā¢j/nā¢ā¢
= nā¢ā¢
āi
niā¢
nā¢ā¢
āj
(nij/niā¢ ā nā¢j/nā¢ā¢)2
nā¢j/nā¢ā¢
= nā¢ā¢
āi
ri(ri ā c)ā²diag(c)ā1(ri ā c)
= nā¢ā¢ Ć Inertia
This is the total inertia of the row profiles. It equals total inertia ofcolumn profiles.
Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17
Geometry
For J = 3
12 8 87 7 68 8 106 9 89 8 7We can plot profiles in R3
Low inertia
Art B. Owen (Stanford Statistics) Correspondence Analysis 6 / 17
Geometry
This example has higher inertia
Art B. Owen (Stanford Statistics) Correspondence Analysis 7 / 17
Geometry
Still higher inertia.
Ļ2 statistics describes variationof row profiles
Similarly for col profiles
Art B. Owen (Stanford Statistics) Correspondence Analysis 8 / 17
Rescale
Distances
Euclidean distance in plot ignores column values
Replace ri by ri with rij =rijācj
Euclidean dist between ri and riā² is āĻ2 distā between ri and riā² .
Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17
Rescale
Distances
Euclidean distance in plot ignores column values
Replace ri by ri with rij =rijācj
Euclidean dist between ri and riā² is āĻ2 distā between ri and riā² .
Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17
Reason for Ļ2
Invariance
Suppose rows i and iā² are proportional
nij/niā²j = Ī± all j = 1, . . . , J
Suppose also that we pool these rows
New niāj = nij + niā²j
and delete originals
Then
New Ļ2 distance between cols j and jā² equals old dist
Principle of distributional equivalence
Common profile, summed mass
Role of Ļ2 in statistical significance is not considered important in thisliterature
Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17
Reason for Ļ2
Invariance
Suppose rows i and iā² are proportional
nij/niā²j = Ī± all j = 1, . . . , J
Suppose also that we pool these rows
New niāj = nij + niā²j
and delete originals
Then
New Ļ2 distance between cols j and jā² equals old dist
Principle of distributional equivalence
Common profile, summed mass
Role of Ļ2 in statistical significance is not considered important in thisliterature
Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17
Reason for Ļ2
Invariance
Suppose rows i and iā² are proportional
nij/niā²j = Ī± all j = 1, . . . , J
Suppose also that we pool these rows
New niāj = nij + niā²j
and delete originals
Then
New Ļ2 distance between cols j and jā² equals old dist
Principle of distributional equivalence
Common profile, summed mass
Role of Ļ2 in statistical significance is not considered important in thisliterature
Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17
Dimension reduction
Now we have a plot
With rows and cols both in RJā1
If J is too big
reduce dimension
by principal components of
rij ā cjācj
plot in reduced dimension
along with images of corners
Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17
Dimension reduction
Now we have a plot
With rows and cols both in RJā1
If J is too big
reduce dimension
by principal components of
rij ā cjācj
plot in reduced dimension
along with images of corners
Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17
Duality
Rows lie in min(I ā 1, J ā 1) dimensional space
So do columns
In PC of row profiles . . . columns are outside
In PC of column profiles . . . rows are outside
Symmetric correspondence analysis overlap the points after rescaling
More notation
Dr = diag(r) = diag(r1, . . . , rI)
Dc = diag(c) = diag(c1, . . . , cJ)
Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17
Duality
Rows lie in min(I ā 1, J ā 1) dimensional space
So do columns
In PC of row profiles . . . columns are outside
In PC of column profiles . . . rows are outside
Symmetric correspondence analysis overlap the points after rescaling
More notation
Dr = diag(r) = diag(r1, . . . , rI)
Dc = diag(c) = diag(c1, . . . , cJ)
Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17
Symmetric analysis
Uses SVD S = UĪ£V ā² where
sij =pij ā ricjā
ricj
Total inertia is āSā2F
āprincipal inertiasā are Ī»2j
Coordinates
Rows (1st k cols of)
F = (Dā1r P ā 1cā²)Dā1
c V1:k = Dā1r UĪ£
Columns (1st k cols of)
G = (Dā1c P ā 1rā²)Dā1
r U1:k = Dā1c V Ī£ā²
Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17
Symmetric analysis
Uses SVD S = UĪ£V ā² where
sij =pij ā ricjā
ricj
Total inertia is āSā2F
āprincipal inertiasā are Ī»2j
Coordinates
Rows (1st k cols of)
F = (Dā1r P ā 1cā²)Dā1
c V1:k = Dā1r UĪ£
Columns (1st k cols of)
G = (Dā1c P ā 1rā²)Dā1
r U1:k = Dā1c V Ī£ā²
Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17
Symmetric analysis
Interpretation is tricky/controversial
ri near riā²ā
cj near cjā²ā
ri near cj ??Rows and columns are not in the same space
Biplots
Due to Gabriel (1971) Biometrika
For matrix Xij
plot rows as ui ā R2
cols as vj ā R2
with uā²ivj.= Xij
A biplot interpretation applies to asymmetric plots
Art B. Owen (Stanford Statistics) Correspondence Analysis 14 / 17
Some finer points
Ghost points
Apply projection to point not in table
E.G. hypothetical row entity,1 impute a presidentās āsenate voting recordā2 compare a stateās economy to those of countries
Treat as fixed profile with mass ā 0
Merged points
Add linear combination or sum of rows, E.G.1 pool columns for math and statistics into āmath sciencesā2 pool rows for EU countries into an EU point
Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17
Some finer points
Ghost points
Apply projection to point not in table
E.G. hypothetical row entity,1 impute a presidentās āsenate voting recordā2 compare a stateās economy to those of countries
Treat as fixed profile with mass ā 0
Merged points
Add linear combination or sum of rows, E.G.1 pool columns for math and statistics into āmath sciencesā2 pool rows for EU countries into an EU point
Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17
Data types
Counts are straightforward
Other ānear measuresā are reasonableI rainfalls, heights, volumes, temperatures KelvinI dollars spentI parts per million
Reweight cols to equalize inertia ā standardizing to equalize varianceRequires iteration
Fuzzy coding
x ā R becomes two columns
(1, 0) for small x say x < L
(0, 1) for large x say x > U
(1ā t, t) for intermediate x t = (xā L)/(U ā L)
Generalizations to > 2 columns
Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17
Data types
Counts are straightforward
Other ānear measuresā are reasonableI rainfalls, heights, volumes, temperatures KelvinI dollars spentI parts per million
Reweight cols to equalize inertia ā standardizing to equalize varianceRequires iteration
Fuzzy coding
x ā R becomes two columns
(1, 0) for small x say x < L
(0, 1) for large x say x > U
(1ā t, t) for intermediate x t = (xā L)/(U ā L)
Generalizations to > 2 columns
Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17
Puzzlers
Does it scale? (eg 108 points in the plane)
Is there a tensor version? (Beyond all pairs of two way versions)
Distributional equivalence vs Poisson models
Further reading
āCorrespondence Analysis in Practiceā M.J. Greenacre, 1993Emphasizes geometry with examples
āTheory and Applications of Correspondence Analysisā M.J.Greenacre, 1984Good coverage of theory with examples
āCorrespondence Analysis and Data Coding with Java and Rā F.Murtagh, 2005Code and worked examples
Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17
Puzzlers
Does it scale? (eg 108 points in the plane)
Is there a tensor version? (Beyond all pairs of two way versions)
Distributional equivalence vs Poisson models
Further reading
āCorrespondence Analysis in Practiceā M.J. Greenacre, 1993Emphasizes geometry with examples
āTheory and Applications of Correspondence Analysisā M.J.Greenacre, 1984Good coverage of theory with examples
āCorrespondence Analysis and Data Coding with Java and Rā F.Murtagh, 2005Code and worked examples
Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17