stat 315c: transposable data correspondence analysis

36
Stat 315c: Transposable Data Correspondence Analysis Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Correspondence Analysis 1 / 17

Upload: others

Post on 11-Feb-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Stat 315c: Transposable DataCorrespondence Analysis

Art B. Owen

Stanford Statistics

Art B. Owen (Stanford Statistics) Correspondence Analysis 1 / 17

Correspondence Analysis

It plots both variables and cases in the same plane.

Clearest motivation is for contingency table data. It gets usedelsewhere too.

Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.

This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.

The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17

Correspondence Analysis

It plots both variables and cases in the same plane.

Clearest motivation is for contingency table data. It gets usedelsewhere too.

Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.

This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.

The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17

Correspondence Analysis

It plots both variables and cases in the same plane.

Clearest motivation is for contingency table data. It gets usedelsewhere too.

Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.

This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.

The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17

Correspondence Analysis

It plots both variables and cases in the same plane.

Clearest motivation is for contingency table data. It gets usedelsewhere too.

Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.

This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.

The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17

Correspondence Analysis

It plots both variables and cases in the same plane.

Clearest motivation is for contingency table data. It gets usedelsewhere too.

Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.

This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.

The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17

Correspondence Analysis

It plots both variables and cases in the same plane.

Clearest motivation is for contingency table data. It gets usedelsewhere too.

Emphasis is on presenting the data themselves as opposed toilluminating an underlying model.

This is an old and classical statistical technique pioneered byJean-Paul Benzecri in the 1960s.

The treatment by Greenacre is particularly clear.

Art B. Owen (Stanford Statistics) Correspondence Analysis 2 / 17

Contingency tables

I Ɨ J table of counts

n11 n12 Ā· Ā· Ā· n1J

n21 n22 Ā· Ā· Ā· n2J...

.... . .

...nI1 nI2 Ā· Ā· Ā· nIJ

Nomenclature

Correspondence matrix P pij = nij/nā€¢ā€¢

Row masses ri = piā€¢ = niā€¢/nā€¢ā€¢

Column masses cj = piā€¢ = niā€¢/nā€¢ā€¢

Row profiles ri = (pi1/ri, . . . , piJ/ri)ā€² āˆˆ RJ

Column profiles cj = (p1j/cj , . . . , pIj/cj)ā€² āˆˆ RI

These are conditional and marginal distributions

Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17

Contingency tables

I Ɨ J table of counts

n11 n12 Ā· Ā· Ā· n1J

n21 n22 Ā· Ā· Ā· n2J...

.... . .

...nI1 nI2 Ā· Ā· Ā· nIJ

Nomenclature

Correspondence matrix P pij = nij/nā€¢ā€¢

Row masses ri = piā€¢ = niā€¢/nā€¢ā€¢

Column masses cj = piā€¢ = niā€¢/nā€¢ā€¢

Row profiles ri = (pi1/ri, . . . , piJ/ri)ā€² āˆˆ RJ

Column profiles cj = (p1j/cj , . . . , pIj/cj)ā€² āˆˆ RI

These are conditional and marginal distributions

Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17

Contingency tables

I Ɨ J table of counts

n11 n12 Ā· Ā· Ā· n1J

n21 n22 Ā· Ā· Ā· n2J...

.... . .

...nI1 nI2 Ā· Ā· Ā· nIJ

Nomenclature

Correspondence matrix P pij = nij/nā€¢ā€¢

Row masses ri = piā€¢ = niā€¢/nā€¢ā€¢

Column masses cj = piā€¢ = niā€¢/nā€¢ā€¢

Row profiles ri = (pi1/ri, . . . , piJ/ri)ā€² āˆˆ RJ

Column profiles cj = (p1j/cj , . . . , pIj/cj)ā€² āˆˆ RI

These are conditional and marginal distributions

Art B. Owen (Stanford Statistics) Correspondence Analysis 3 / 17

First moments: centroids

Row centroid

Iāˆ‘i=1

riri =Iāˆ‘

i=1

ri

(pi1

ri, . . . ,

piJ

ri

)ā€²= (c1, . . . , cJ)ā€² ā‰” c

Column centroid

Jāˆ‘j=1

cjcj = (r1, . . . , rI)ā€² ā‰” r

Upshot

ā€˜Massā€™ weighted average of row profiles is marginal distribution overcolumns

Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17

First moments: centroids

Row centroid

Iāˆ‘i=1

riri =Iāˆ‘

i=1

ri

(pi1

ri, . . . ,

piJ

ri

)ā€²= (c1, . . . , cJ)ā€² ā‰” c

Column centroid

Jāˆ‘j=1

cjcj = (r1, . . . , rI)ā€² ā‰” r

Upshot

ā€˜Massā€™ weighted average of row profiles is marginal distribution overcolumns

Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17

First moments: centroids

Row centroid

Iāˆ‘i=1

riri =Iāˆ‘

i=1

ri

(pi1

ri, . . . ,

piJ

ri

)ā€²= (c1, . . . , cJ)ā€² ā‰” c

Column centroid

Jāˆ‘j=1

cjcj = (r1, . . . , rI)ā€² ā‰” r

Upshot

ā€˜Massā€™ weighted average of row profiles is marginal distribution overcolumns

Art B. Owen (Stanford Statistics) Correspondence Analysis 4 / 17

Second moments: inertias

Chisquare for independence as weighted Euclidean distance

X2 =āˆ‘

i

āˆ‘j

(nij āˆ’ niā€¢nā€¢j/nā€¢ā€¢)2

niā€¢nā€¢j/nā€¢ā€¢

=āˆ‘

i

niā€¢

āˆ‘j

(nij/niā€¢ āˆ’ nā€¢j/nā€¢ā€¢)2

nā€¢j/nā€¢ā€¢

= nā€¢ā€¢

āˆ‘i

niā€¢

nā€¢ā€¢

āˆ‘j

(nij/niā€¢ āˆ’ nā€¢j/nā€¢ā€¢)2

nā€¢j/nā€¢ā€¢

= nā€¢ā€¢

āˆ‘i

ri(ri āˆ’ c)ā€²diag(c)āˆ’1(ri āˆ’ c)

= nā€¢ā€¢ Ɨ Inertia

This is the total inertia of the row profiles. It equals total inertia ofcolumn profiles.

Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17

Second moments: inertias

Chisquare for independence as weighted Euclidean distance

X2 =āˆ‘

i

āˆ‘j

(nij āˆ’ niā€¢nā€¢j/nā€¢ā€¢)2

niā€¢nā€¢j/nā€¢ā€¢

=āˆ‘

i

niā€¢

āˆ‘j

(nij/niā€¢ āˆ’ nā€¢j/nā€¢ā€¢)2

nā€¢j/nā€¢ā€¢

= nā€¢ā€¢

āˆ‘i

niā€¢

nā€¢ā€¢

āˆ‘j

(nij/niā€¢ āˆ’ nā€¢j/nā€¢ā€¢)2

nā€¢j/nā€¢ā€¢

= nā€¢ā€¢

āˆ‘i

ri(ri āˆ’ c)ā€²diag(c)āˆ’1(ri āˆ’ c)

= nā€¢ā€¢ Ɨ Inertia

This is the total inertia of the row profiles. It equals total inertia ofcolumn profiles.

Art B. Owen (Stanford Statistics) Correspondence Analysis 5 / 17

Geometry

For J = 3

12 8 87 7 68 8 106 9 89 8 7We can plot profiles in R3

Low inertia

Art B. Owen (Stanford Statistics) Correspondence Analysis 6 / 17

Geometry

This example has higher inertia

Art B. Owen (Stanford Statistics) Correspondence Analysis 7 / 17

Geometry

Still higher inertia.

Ļ‡2 statistics describes variationof row profiles

Similarly for col profiles

Art B. Owen (Stanford Statistics) Correspondence Analysis 8 / 17

Rescale

Distances

Euclidean distance in plot ignores column values

Replace ri by ri with rij =rijāˆšcj

Euclidean dist between ri and riā€² is ā€œĻ‡2 distā€ between ri and riā€² .

Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17

Rescale

Distances

Euclidean distance in plot ignores column values

Replace ri by ri with rij =rijāˆšcj

Euclidean dist between ri and riā€² is ā€œĻ‡2 distā€ between ri and riā€² .

Art B. Owen (Stanford Statistics) Correspondence Analysis 9 / 17

Reason for Ļ‡2

Invariance

Suppose rows i and iā€² are proportional

nij/niā€²j = Ī± all j = 1, . . . , J

Suppose also that we pool these rows

New niāˆ—j = nij + niā€²j

and delete originals

Then

New Ļ‡2 distance between cols j and jā€² equals old dist

Principle of distributional equivalence

Common profile, summed mass

Role of Ļ‡2 in statistical significance is not considered important in thisliterature

Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17

Reason for Ļ‡2

Invariance

Suppose rows i and iā€² are proportional

nij/niā€²j = Ī± all j = 1, . . . , J

Suppose also that we pool these rows

New niāˆ—j = nij + niā€²j

and delete originals

Then

New Ļ‡2 distance between cols j and jā€² equals old dist

Principle of distributional equivalence

Common profile, summed mass

Role of Ļ‡2 in statistical significance is not considered important in thisliterature

Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17

Reason for Ļ‡2

Invariance

Suppose rows i and iā€² are proportional

nij/niā€²j = Ī± all j = 1, . . . , J

Suppose also that we pool these rows

New niāˆ—j = nij + niā€²j

and delete originals

Then

New Ļ‡2 distance between cols j and jā€² equals old dist

Principle of distributional equivalence

Common profile, summed mass

Role of Ļ‡2 in statistical significance is not considered important in thisliterature

Art B. Owen (Stanford Statistics) Correspondence Analysis 10 / 17

Dimension reduction

Now we have a plot

With rows and cols both in RJāˆ’1

If J is too big

reduce dimension

by principal components of

rij āˆ’ cjāˆšcj

plot in reduced dimension

along with images of corners

Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17

Dimension reduction

Now we have a plot

With rows and cols both in RJāˆ’1

If J is too big

reduce dimension

by principal components of

rij āˆ’ cjāˆšcj

plot in reduced dimension

along with images of corners

Art B. Owen (Stanford Statistics) Correspondence Analysis 11 / 17

Duality

Rows lie in min(I āˆ’ 1, J āˆ’ 1) dimensional space

So do columns

In PC of row profiles . . . columns are outside

In PC of column profiles . . . rows are outside

Symmetric correspondence analysis overlap the points after rescaling

More notation

Dr = diag(r) = diag(r1, . . . , rI)

Dc = diag(c) = diag(c1, . . . , cJ)

Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17

Duality

Rows lie in min(I āˆ’ 1, J āˆ’ 1) dimensional space

So do columns

In PC of row profiles . . . columns are outside

In PC of column profiles . . . rows are outside

Symmetric correspondence analysis overlap the points after rescaling

More notation

Dr = diag(r) = diag(r1, . . . , rI)

Dc = diag(c) = diag(c1, . . . , cJ)

Art B. Owen (Stanford Statistics) Correspondence Analysis 12 / 17

Symmetric analysis

Uses SVD S = UĪ£V ā€² where

sij =pij āˆ’ ricjāˆš

ricj

Total inertia is ā€–Sā€–2F

ā€™principal inertiasā€™ are Ī»2j

Coordinates

Rows (1st k cols of)

F = (Dāˆ’1r P āˆ’ 1cā€²)Dāˆ’1

c V1:k = Dāˆ’1r UĪ£

Columns (1st k cols of)

G = (Dāˆ’1c P āˆ’ 1rā€²)Dāˆ’1

r U1:k = Dāˆ’1c V Ī£ā€²

Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17

Symmetric analysis

Uses SVD S = UĪ£V ā€² where

sij =pij āˆ’ ricjāˆš

ricj

Total inertia is ā€–Sā€–2F

ā€™principal inertiasā€™ are Ī»2j

Coordinates

Rows (1st k cols of)

F = (Dāˆ’1r P āˆ’ 1cā€²)Dāˆ’1

c V1:k = Dāˆ’1r UĪ£

Columns (1st k cols of)

G = (Dāˆ’1c P āˆ’ 1rā€²)Dāˆ’1

r U1:k = Dāˆ’1c V Ī£ā€²

Art B. Owen (Stanford Statistics) Correspondence Analysis 13 / 17

Symmetric analysis

Interpretation is tricky/controversial

ri near riā€²āˆš

cj near cjā€²āˆš

ri near cj ??Rows and columns are not in the same space

Biplots

Due to Gabriel (1971) Biometrika

For matrix Xij

plot rows as ui āˆˆ R2

cols as vj āˆˆ R2

with uā€²ivj.= Xij

A biplot interpretation applies to asymmetric plots

Art B. Owen (Stanford Statistics) Correspondence Analysis 14 / 17

Some finer points

Ghost points

Apply projection to point not in table

E.G. hypothetical row entity,1 impute a presidentā€™s ā€™senate voting recordā€™2 compare a stateā€™s economy to those of countries

Treat as fixed profile with mass ā†“ 0

Merged points

Add linear combination or sum of rows, E.G.1 pool columns for math and statistics into ā€œmath sciencesā€2 pool rows for EU countries into an EU point

Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17

Some finer points

Ghost points

Apply projection to point not in table

E.G. hypothetical row entity,1 impute a presidentā€™s ā€™senate voting recordā€™2 compare a stateā€™s economy to those of countries

Treat as fixed profile with mass ā†“ 0

Merged points

Add linear combination or sum of rows, E.G.1 pool columns for math and statistics into ā€œmath sciencesā€2 pool rows for EU countries into an EU point

Art B. Owen (Stanford Statistics) Correspondence Analysis 15 / 17

Data types

Counts are straightforward

Other ā€™near measuresā€™ are reasonableI rainfalls, heights, volumes, temperatures KelvinI dollars spentI parts per million

Reweight cols to equalize inertia ā‰ˆ standardizing to equalize varianceRequires iteration

Fuzzy coding

x āˆˆ R becomes two columns

(1, 0) for small x say x < L

(0, 1) for large x say x > U

(1āˆ’ t, t) for intermediate x t = (xāˆ’ L)/(U āˆ’ L)

Generalizations to > 2 columns

Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17

Data types

Counts are straightforward

Other ā€™near measuresā€™ are reasonableI rainfalls, heights, volumes, temperatures KelvinI dollars spentI parts per million

Reweight cols to equalize inertia ā‰ˆ standardizing to equalize varianceRequires iteration

Fuzzy coding

x āˆˆ R becomes two columns

(1, 0) for small x say x < L

(0, 1) for large x say x > U

(1āˆ’ t, t) for intermediate x t = (xāˆ’ L)/(U āˆ’ L)

Generalizations to > 2 columns

Art B. Owen (Stanford Statistics) Correspondence Analysis 16 / 17

Puzzlers

Does it scale? (eg 108 points in the plane)

Is there a tensor version? (Beyond all pairs of two way versions)

Distributional equivalence vs Poisson models

Further reading

ā€œCorrespondence Analysis in Practiceā€ M.J. Greenacre, 1993Emphasizes geometry with examples

ā€œTheory and Applications of Correspondence Analysisā€ M.J.Greenacre, 1984Good coverage of theory with examples

ā€œCorrespondence Analysis and Data Coding with Java and Rā€ F.Murtagh, 2005Code and worked examples

Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17

Puzzlers

Does it scale? (eg 108 points in the plane)

Is there a tensor version? (Beyond all pairs of two way versions)

Distributional equivalence vs Poisson models

Further reading

ā€œCorrespondence Analysis in Practiceā€ M.J. Greenacre, 1993Emphasizes geometry with examples

ā€œTheory and Applications of Correspondence Analysisā€ M.J.Greenacre, 1984Good coverage of theory with examples

ā€œCorrespondence Analysis and Data Coding with Java and Rā€ F.Murtagh, 2005Code and worked examples

Art B. Owen (Stanford Statistics) Correspondence Analysis 17 / 17