part iii: dimensionality reduction...pearson, k. (1901), on lines and planes of closest fit to...

23
Part III: Dimensionality reduction Hotelling’s principal component analysis (PCA) to generalized PCA for non-Gaussian data Hotelling, H. (1933), Analysis of a complex of statistical variables into principal components Journal of Educational Psychology 24(6), 417-441 Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572.

Upload: others

Post on 08-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Part III: Dimensionality reduction

Hotelling’s principal component analysis (PCA)to generalized PCA for non-Gaussian data

Hotelling, H. (1933), Analysis of a complex of statisticalvariables into principal componentsJournal of Educational Psychology 24(6), 417-441

Pearson, K. (1901), On Lines and Planes of Closest Fit toSystems of Points in SpacePhilosophical Magazine 2(11), 559-572.

Page 2: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Hotelling’s Test Data

� Hotelling analyzed the correlations found in a sample of140 seventh-grade children among numerous tests(from T. L. Kelley’s study).

Displayed below are the correlations among (1) readingspeed, (2) reading power, (3) arithmetic speed and(4) arithmetic power:

R =

1 .698 .264 .0811 −.061 .092

1 .5941

� Given correlated variables Xj , does there exist some morefundamental set of independent variables, perhaps fewerin number than the original Xj ’s, which determine thevalues of Xj ’s?

Page 3: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Principal Component Analysis (PCA)

PCA is concerned with explaining the variance-covariancestructure of a set of correlated variables through a few linearcombinations of these variables.

● ●

● ●

●●

1.2 1.4 1.6 1.8 2.0 2.2 2.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Humerus

D.Hum

erus

● ●

● ●

●●

1.2 1.4 1.6 1.8 2.0 2.2 2.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

random projection

Humerus

D.Hum

erus

● ●

●●

●●

●●●

● ●

● ●●

● ●

● ●

● ●

●●

1.2 1.4 1.6 1.8 2.0 2.2 2.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

PC1

Humerus

D.Hum

erus

●●

●●

●●●

Figure: Data on the mineral content measurements (g/cm) of threebones (humerus, radius and ulna) on the dominant and nondominantsides for 25 old women

Page 4: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Variance Maximization

� Given p correlated variables X = (X1, · · · ,Xp)�, consider alinear combination of Xj ’s,

p�

j=1

ajXj = a�X

for a = (a1, . . . , ap)� ∈ Rp with �a�2 = 1.

� The first principal component direction is defined as thevector a that gives the largest sample variance of a�Xamongst all normalized linear combinations of Xj :

maxa∈Rp,�a�2=1

a�Sna

where Sn is the sample variance-covariance matrix of X .

Page 5: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Principal Components

� Let Sn =�p

j=1 λj vjv�j with eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λp > 0, and the correspondingeigenvectors v1, . . . , vp.Then the first principal component direction is given by v1.

� The derived variable Z1 = v�1 X is called the first principal

component.

� Similarly, the second principal component direction isdefined as the vector a that gives the largest samplevariance of a�X among all normalized a subject to a�Xbeing uncorrelated with v�

1 X . It is given by v2.

� In general, the j th principal component direction is definedsuccessively from j = 1 to p.

Page 6: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Pearson’s Reconstruction Error Formulation

Pearson, K. (1901), On Lines and Planes of Closest Fit toSystems of Points in Space

� Given x1, · · · , xn ∈ Rp, consider the following dataapproximation:

xi ≈ µ+ vv�(xi − µ)

where µ ∈ Rp and v is a unit vector in Rp so that vv� is arank-one projection.

� What are µ and v ∈ Rp that minimize the reconstructionerror?

minv�v=1

n�

i=1

�xi − µ− vv�(xi − µ)�2

� µ̂ = x̄ and v̂ = v1 minimize the error.

Page 7: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Minimization of Reconstruction Error

� More generally, consider a rank-k (< p) approximation:

xi ≈ µ+ VV�(xi − µ)

where µ ∈ Rp and V is a p × k matrix with orthogonalcolumns that results in a rank-k projection of VV�.

� Wish to minimize the reconstruction error:n�

i=1

�xi − µ− VV�(xi − µ)�2

subject to V�V = Ik

� µ̂ = x̄ and V̂ = [v1, · · · , vk ] provide the best reconstructionof the data.

Page 8: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

PCA for Non-Gaussian Data?

� PCA finds a low rank subspace by implicitly minimizing thereconstruction error under squared error loss, which islinked to Gaussian distribution.

� Binary, count, or non-negative data abound in practice.

e.g. images, term frequencies for documents, ratings formovies, click-through rates for on-line ads

� How to generalize PCA to non-Gaussian data?

Page 9: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Generalized PCACollins et al. (2001), A generalization of principal componentsanalysis to the exponential family

� Draws on the ideas from the exponential family andgeneralized linear models.

� For Gaussian data, assume that xi ∼ Np(θi , Ip) and θi ∈ Rp

lies in a k dimensional subspace:

for a basis {b�}k�=1, θi =

k�

�=1

ai�b� = B(p×k)ai

� To find Θ = [θij ], maximize the log likelihood or equivalentlyminimize the negative log likelihood (or deviance):

minn�

i=1

�xi − θi�2 = �X −Θ�2F = �X − AB��2

F

Page 10: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Generalized PCA� According to Eckart-Young theorem, the best rank k

approximation of X (= Un×pDp×pV�p×p) is given by the rank

k truncated singular value decomposition UkDk� �� �A

V�k����

B�

.

� For exponential family data, factorize the matrix of naturalparameter values Θ as AB� with rank-k matrices An×k andBp×k (of orthogonal columns) by maximizing the loglikelihood.

� For binary data X = [xij ] with P = [pij ], “logistic PCA” looksfor a factorization of Θ =

�log pij

1−pij

�= AB� that maximizes

�(X ;Θ) =�

i,j

�xij(a�

i bj∗)− log(1 + exp(a�i bj∗))

subject to B�B = Ik .

Page 11: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Drawbacks of the Matrix Factorization Formulation

� Involves estimation of both case-specific (or row-specific)factors A and variable-specific (or column-specific) factorsB: more of extension of SVD than PCA.

� The number of parameters increases with observations.

� The scores of generalized PC for new data involveadditional optimization while PC scores for standard PCAare simple linear combinations of the data.

Page 12: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Alternative Interpretation of Standard PCA

� Assuming that data are centered (µ = 0),

minn�

i=1

�xi − VV�xi�2 = �X − XVV��2F

subject to V�V = Ik

� XVV� can be viewed as a rank k projection of the matrixof natural parameters (“means” in this case) of thesaturated model Θ̃ for Gaussian data.

� Standard PCA finds the best rank k projection of Θ̃ byminimizing the deviance under Gaussian distribution.

Page 13: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

New Formulation of Logistic PCA

Landgraf and Lee (2015), Dimensionality Reduction for BinaryData through the Projection of Natural Parameters

� Given xij ∼ Bernoulli(pij), the natural parameter (logit pij )of the saturated model is

θ̃ij = logit(xij) = ∞× (2xij − 1)

We will approximate θ̃ij ≈ m × (2xij − 1) for large m > 0.

� Project Θ̃ to a k -dimensional subspace by using thedeviance D(X ;Θ) = −2�(X ;Θ) as a loss:

minV

D(X ; Θ̃VV�� �� �

Θ̂

) = −2�

i,j

�xij θ̂ij − log(1 + exp(θ̂ij))

subject to V�V = Ik

Page 14: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Logistic PCA vs Logistic SVD

� The previous logistic SVD gives an approximation of logitP:

Θ̂LSVD = AB�

� Alternatively, logistic PCA gives

Θ̂LSVD = Θ̃V����A

V�,

which has much fewer parameters.

� Computation of PC scores on new data only requires linearcombinations of θ̃(x) for logistic PCA while Logistic SVDrequires fitting k -dimensional logistic regression for eachnew observation.

� Logistic SVD with additional A is prone to overfit.

Page 15: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

New Formulation of Generalized PCA

� The idea can be applied to any exponential familydistribution.

� Find the best rank k projection of the matrix of naturalparameters from the saturated model Θ̃X by minimizing theappropriate deviance for the data:

minV

D(X ; Θ̃X VV�)

subject to V�V = Ik

� If desired, main effects µ can be added to theapproximation of Θ:

Θ̂ = 1µ� + (Θ̃− 1µ�)VV�

Page 16: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Medical Diagnosis Data

� Part of electronic health record data on 12,000 adultpatients admitted to the intensive care units (ICU) in OhioState University Medical Center from 2007 to 2010(provided by S. Hyun)

� Patients are classified as having one or more diseases ofover 800 disease categories from the InternationalClassification of Diseases (ICD-9).

� Interested in characterizing the co-morbidity as latentfactors, which can be used to define patient profiles forprediction of other clinical outcomes

� Analysis is based on a sample of 1,000 patients, whichreduced the number of disease categories to 584.

Page 17: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Patient-Diagnosis Matrix

courtesy of A. Landgraf

Page 18: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Deviance Explained by Components

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●

●●●

●●●●●●

●●●●●●●

●●●

●●●●●●●●●

Cumulative Marginal

0%

25%

50%

75%

0.0%

2.5%

5.0%

7.5%

0 10 20 30 0 10 20 30Number of Principal Components

% o

f Dev

ianc

e Ex

plai

ned

LPCALSVDPCA

courtesy of A. Landgraf

Page 19: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Deviance Explained by Parameters

0%

25%

50%

75%

0k 10k 20k 30k 40kNumber of Free Parameters

% o

f Dev

ianc

e Ex

plai

ned

LPCALSVDPCA

courtesy of A. Landgraf

Page 20: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Deviance Predicted

●●●●●●●●

●●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●

●●●

●●●

●●●●●

Cumulative Marginal

0%

10%

20%

30%

40%

0%

2%

4%

6%

0 10 20 30 0 10 20 30Number of principal components

% o

f dev

ianc

e pr

edic

ted

LPCAPCA

courtesy of A. Landgraf

Page 21: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Interpretation of Loadings

●●

●● ●

●● ●

●●

●●

●● ● ●● ●

● ●●

● ● ●

● ●

●●

●●

●●

● ●

●●

● ● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●●

●● ●

● ●

●●

●●

●●

●●

● ●● ●

●●

● ●

● ●●● ●

●●

●●

●●

●●●

● ●

● ●● ●

● ●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●●●

●●●

●● ● ●

●●

●● ●●

●●

●●●● ● ● ●● ●●

●●●

●●●

● ●

● ●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●● ●●

●●●

●●● ●

● ●● ●●●●

●●● ●

●●●●

● ● ●●●

●●

● ●●

● ●●● ●●●●

●●

● ●●●

●●● ●●

●●

●●

●●●

●●

●●●

● ●●●

●●

● ●● ●

●●

● ●●

● ●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●● ●

● ●●

●●

●●

●●

●● ●

● ●●

●●

●●

● ●

●●●

● ●●

● ●

● ●●●

●●

● ●

●●

●●

●●

●● ●●

● ●

●●

●● ●

●●

●●

● ●●●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

●●

●● ●

●●● ● ●

●●

●●

●●

●●

●●● ●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●● ●●

● ●● ●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●●●

●●● ●●● ●● ● ●● ●● ● ●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

● ● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●● ●●● ●● ●

●●

●●●● ●

●●

●●

●● ●

●●

●●

●●●

●●●

●●●

●●●

●●

● ●●

● ●

●●

● ●●

●●

●● ●

● ●●●

●● ●

● ●

01:Gram−neg septicemia NEC03:Hyperpotassemia

08:Acute respiratry failure

10:Acute kidney failure NOS16:Speech Disturbance NEC

17:Asphyxiation/strangulat

03:DMII wo cmp nt st uncntr

03:Hyperlipidemia NEC/NOS

04:Leukocytosis NOS

07:Mal hy kid w cr kid I−IV07:Old myocardial infarct

07:Crnry athrscl natve vssl

07:Systolic hrt failure NOS

09:Oral soft tissue dis NEC

10:Chr kidney dis stage IIIV:Gastrostomy status

−0.2

0.0

0.2

Component 1 Component 2

Figure: The first component is characterized by common seriousconditions that bring patients to ICU, and the second component isdominated by diseases of the circulatory system (07’s). courtesy of A. Landgraf

Page 22: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

Acknowledgements

� Grace Wahba and Yi Lin(smoothing splines, SVM and regularization)

� Tao Shi (multivariate analysis)

� Rui Wang (comparison of classifiers)

� Andrew Landgraf (logistic PCA)

� Sookyung Hyun and Cheryl Newton (diagnosis data)

� Rogelio Ramos and Johan Van Horebeek @ CIMAT

Page 23: Part III: Dimensionality reduction...Pearson, K. (1901), On Lines and Planes of Closest Fit to Systems of Points in Space Philosophical Magazine 2(11), 559-572. Hotelling’s Test

This short course is largely based on the following references:

T. Hastie, R. Tibshirani, and J. Friedman.The Elements of Statistical Learning.Springer Verlag, New York, 2001.

R. A. Johnson and D. W. Wichern.Applied Multivariate Statistical Analysis.Pearson Prentice Hall, 6th edition, 2007.

Andrew J. Landgraf and Yoonkyung Lee.Dimensionality reduction for binary data through the projection ofnatural parameters.Technical Report 890, Department of Statistics, The Ohio StateUniversity, 2015.