detailed look at pca

99
Detailed Look at PCA Three Important (& Interesting) Viewpoints: 1. Mathematics 2. Numerics 3. Statistics 1 st : Review Linear Alg. and Multivar. Prob.

Upload: halla-kinney

Post on 30-Dec-2015

15 views

Category:

Documents


0 download

DESCRIPTION

Detailed Look at PCA. Three Important (& Interesting) Viewpoints: Mathematics Numerics Statistics 1 st : Review Linear A lg. and M ultivar . Prob. Different Views of PCA. Solves several optimization problems: Direction to maximize SS of 1-d proj’d data - PowerPoint PPT Presentation

TRANSCRIPT

Title

Detailed Look at PCA

Three Important (& Interesting) Viewpoints: Mathematics Numerics Statistics1st: Review Linear Alg. and Multivar. Prob.1Different Views of PCA

Solves several optimization problems:Direction to maximize SS of 1-d projd dataDirection to minimize SS of residuals (same, by Pythagorean Theorem)Best fit line to data in orthogonal sense(vs. regression of Y on X = vertical sense& regression of X on Y = horizontal sense)2Different Views of PCA

ProjectedResiduals3Different Views of PCA

VerticalResiduals(X predicts Y)4PCA Data Representn (Cont.)Now Using:

Spectral Representation (Raw Data):

Where:Entries of are LoadingsEntries of are Scores

5PCA Data Representn (Cont.)Reduced Rank Representation:

Reconstruct Using Only Terms(Assuming Decreasing Eigenvalues)

6PCA Data Representn (Cont.)SCREE Plot Drawbacks: What is a Knee? What if There are Several? Knees Depend on Scaling (Power? log?)

Personal Suggestions:Find Auxiliary Cutoffs (Inter-Rater Variation)Use the Full Range (ala Scale Space)7Alternate PCA ComputationAnother Variation: Dual PCA

Idea: Useful to View Data Matrix as

Colns as Data Objects

8Alternate PCA ComputationAnother Variation: Dual PCA

Idea: Useful to View Data Matrix as Both

Colns as Data Objects &Rows as Data Objects

9Alternate PCA ComputationDual PCA Computation:Same as above, but replace withSo can almost replacewithThen use SVD, , to get:

Note: Same Eigenvalues

10Return to Big PictureMain statistical goals of OODA:Understanding population structureLow dimal Projections, PCA Classification (i. e. Discrimination)Understanding 2+ populationsTime Series of Data ObjectsChemical Spectra, Mortality DataVertical Integration of Data Types

11Classification - DiscriminationBackground: Two Class (Binary) version:Using training data fromClass +1 and Class -1Develop a rule for assigning new data to a ClassCanonical Example: Disease DiagnosisNew Patients are Healthy or IllDetermine based on measurements12Classification - DiscriminationImportant Distinction: Classification vs. Clustering

Useful terminology:Classification: supervised learningClustering: unsupervised learning13Classification BasicsFor Simple Toy Example:

ProjectOn MD

& splitat center

14HDLSStmbMdif.psClassification BasicsPCA for slanted clouds:PC1 terriblePC2 better?Still missesright dirnDoesnt useClass Labels

15HDLSSod1PCA.psClassification BasicsMean Difference for slanted clouds:A little better?Still missesright dirnWant toaccount forcovariance

16HDLSSod1Mdif.psClassification BasicsBetter Solution: Fisher Linear Discrimination

Gets theright dirn

How does it work?

17HDLSSod1FLD.psFisher Linear DiscriminationOther common terminology (for FLD):Linear Discriminant Analysis (LDA)

Original Paper: Fisher (1936)18Fisher Linear DiscriminationCareful development:Useful notation (data vectors of length ):Class +1: Class -1:

Centerpoints:and

19Fisher Linear DiscriminationCovariances, for(outer products)Based on centered, normalized data matrices:

20Fisher Linear DiscriminationCovariances, for(outer products)Based on centered, normalized data matrices:

Note: use MLE version of estimated covariance matrices, for simpler notation

21Fisher Linear DiscriminationMajor Assumption:Class covariances are the same (or similar)Like this: Not this:

22HDLSSod1Raw.ps & HDLSSxd1Raw.psFisher Linear DiscriminationGood estimate of (common) within class cov?Pooled (weighted average) within class cov:

based on the combined full data matrix:

23Fisher Linear DiscriminationGood estimate of (common) within class cov?Pooled (weighted average) within class cov:

based on the combined full data matrix:

Note: Different Means

24Fisher Linear DiscriminationNote: is similar to from beforeI.e. covariance matrix ignoring class labelsImportant Difference:Class by Class CenteringWill be important later

25HDLSSod1Raw.psFisher Linear DiscriminationSimple way to find correct cov. adjustment:Individually transform subpopulations sospherical about their means

For define

26HDLSSod1egFLD.psFisher Linear DiscriminationFor define

Note: This spheres the data, in the sense that

27HDLSSod1egFLD.psFisher Linear DiscriminationThen:In Transformed Space,Best separating hyperplaneisPerpendicular bisector of line between means

28HDLSSod1egFLD.psFisher Linear DiscriminationIn Transformed Space,Separating Hyperplane has:Transformed Normal Vector:

29HDLSSod1egFLD.psFisher Linear DiscriminationIn Transformed Space,Separating Hyperplane has:Transformed Normal Vector:

Transformed Intercept:

30HDLSSod1egFLD.ps

Fisher Linear DiscriminationIn Transformed Space,Separating Hyperplane has:Transformed Normal Vector:

Transformed Intercept:

Sep. Hyperp. has Equation:

31HDLSSod1egFLD.psFisher Linear DiscriminationThus discrimination rule is:Given a new data vector ,Choose Class +1 when:

32Fisher Linear DiscriminationThus discrimination rule is:Given a new data vector ,Choose Class +1 when:

i.e. (transforming back to original space)

Using, for symmetric and invertible:

33Fisher Linear DiscriminationThus discrimination rule is:Given a new data vector ,Choose Class +1 when:

i.e. (transforming back to original space)

where:

34Fisher Linear DiscriminationSo (in origl space) have separting hyperplane with:Normal vector: Intercept:

35HDLSSod1egFLD.psFisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:

36Fisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:unit free, i.e. standardizedessentially mod out covariance structure

37Fisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:unit free, i.e. standardizedessentially mod out covariance structureEuclidean dist. applied to & Same as key transformation for FLD

38Fisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:unit free, i.e. standardizedessentially mod out covariance structureEuclidean dist. applied to & Same as key transformation for FLDI.e. FLD ismean difference in Mahalanobis space

39Classical DiscriminationAbove derivation of FLD was:NonstandardNot in any textbooks(?)40Classical DiscriminationAbove derivation of FLD was:NonstandardNot in any textbooks(?)Nonparametric (dont need Gaussian data)I.e. Used no probability distributions41Classical DiscriminationAbove derivation of FLD was:NonstandardNot in any textbooks(?)Nonparametric (dont need Gaussian data)I.e. Used no probability distributionsMore Machine Learning than Statistics42Classical DiscriminationFLD Likelihood View

43Classical DiscriminationFLD Likelihood ViewAssume: Class distributions are multivariate for

44Classical DiscriminationFLD Likelihood ViewAssume: Class distributions are multivariate for

strong distributional assumption + common covariance

45Classical DiscriminationFLD Likelihood View (cont.)At a location , the likelihood ratio, forchoosing between Class +1 and Class -1, is:

where is the Gaussian density with covariance

46Classical DiscriminationFLD Likelihood View (cont.)Simplifying, using the Gaussian density:

Gives

47Classical DiscriminationFLD Likelihood View (cont.)Simplifying, using the Gaussian density:

Gives (critically using common covariances):

48Classical DiscriminationFLD Likelihood View (cont.)Simplifying, using the Gaussian density:

Gives (critically using common covariances):

49Classical DiscriminationFLD Likelihood View (cont.)But:

for

50Classical DiscriminationFLD Likelihood View (cont.)But:

so:

Note: same terms subtract off

51Classical DiscriminationFLD Likelihood View (cont.)But:

so:

Note: cross terms have cancellation

52Classical DiscriminationFLD Likelihood View (cont.)But:

so:

Thus when

53Classical DiscriminationFLD Likelihood View (cont.)But:

so:

Thus when

i.e.

54Classical DiscriminationFLD Likelihood View (cont.)Replacing , and by maximum likelihood estimates:, and Gives the likelihood ratio discrimination rule

55Classical DiscriminationFLD Likelihood View (cont.)Replacing , and by maximum likelihood estimates:, and Gives the likelihood ratio discrimination rule:Choose Class +1, when

56Classical DiscriminationFLD Likelihood View (cont.)Replacing , and by maximum likelihood estimates:, and Gives the likelihood ratio discrimination rule:Choose Class +1, when

Same as above, so: FLD can be viewed asLikelihood Ratio Rule

57Classical DiscriminationFLD Generalization IGaussian Likelihood Ratio Discrimination(a. k. a. nonlinear discriminant analysis)58Classical DiscriminationFLD Generalization IGaussian Likelihood Ratio Discrimination(a. k. a. nonlinear discriminant analysis)Idea: Assume class distributions are Different covariances!

59Classical DiscriminationFLD Generalization IGaussian Likelihood Ratio Discrimination(a. k. a. nonlinear discriminant analysis)Idea: Assume class distributions are Different covariances!Likelihood Ratio rule is straightfd numl calc.(thus can easily implement, and do discrimn)

60Classical DiscriminationGaussian Likelihood Ratio Discrimn (cont.)No longer have separg hyperplane reprn(instead regions determined by quadratics)(fairly complicated case-wise calculations)

61Classical DiscriminationGaussian Likelihood Ratio Discrimn (cont.)No longer have separg hyperplane reprn(instead regions determined by quadratics)(fairly complicated case-wise calculations)

Graphical display: for each point, color as:Yellow if assigned to Class +1Cyan if assigned to Class -1(intensity is strength of assignment)62Classical DiscriminationFLD for Tilted Point Clouds Works well

63PEod1FLDe1.psClassical DiscriminationGLR for Tilted Point Clouds Works well

64PEod1GLRe1.psClassical DiscriminationFLD for Donut Poor, no plane can work

65PEdonFLDe1.ps

Classical DiscriminationGLR for Donut Works well (good quadratic)66PEdonGLRe1.ps

Classical DiscriminationGLR for Donut Works well (good quadratic)

(Even though data not Gaussian)67PEdonGLRe1.psClassical DiscriminationFLD for X Poor, no plane can work

68PExd3FLDe1.psClassical DiscriminationGLR for X Better, but not great

69PExd3GLRe1.psClassical DiscriminationSummary of FLD vs. GLR:Tilted Point Clouds DataFLD goodGLR goodDonut DataFLD badGLR goodX DataFLD badGLR OK, not greatClassical Conclusion: GLR generally better(will see a different answer for HDLSS data)70Classical DiscriminationFLD Generalization II (Gen. I was GLR)Different prior probabilitiesMain idea: Give different weights to 2 classesI.e. assume not a priori equally likelyDevelopment is straightforwardModified likelihoodChange intercept in FLDWont explore further here 71Classical DiscriminationFLD Generalization IIIPrincipal Discriminant AnalysisIdea: FLD-like approach to > 2 classes72Classical DiscriminationFLD Generalization IIIPrincipal Discriminant AnalysisIdea: FLD-like approach to > 2 classesAssumption: Class covariance matrices are the same (similar)(but not Gaussian, same situation as for FLD)73Classical DiscriminationFLD Generalization IIIPrincipal Discriminant AnalysisIdea: FLD-like approach to > 2 classesAssumption: Class covariance matrices are the same (similar)(but not Gaussian, same situation as for FLD)Main idea: Quantify location of classes by their means

74Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find interesting directions among the means:PCA on set of means75Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find interesting directions among the means:PCA on set of means

(Think Analog of Mean Difference)76Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find interesting directions among the means:PCA on set of meansi.e. Eigen-analysis of between class covariance matrixWhere

Aside: can show: overall

77Classical DiscriminationPrincipal Discriminant Analysis (cont.)But PCA only works like Mean Difference,Expect can improve bytaking covariance into account.

(Recall Improvement of FLD over MD)78Classical DiscriminationPrincipal Discriminant Analysis (cont.)But PCA only works like Mean Difference,Expect can improve bytaking covariance into account.Blind application of above ideas suggests eigen-analysis of:

79Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:smarter ways to compute (generalized eigenvalue)other representations (this solves optimization probs)80Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:smarter ways to compute (generalized eigenvalue)other representations (this solves optimization probs)Special case: 2 classes, reduces to standard FLDGood reference for more: Section 3.8 of:Duda, Hart & Stork (2001)81Classical DiscriminationSummary of Classical Ideas:Among Simple MethodsMD and FLD sometimes similarSometimes FLD betterSo FLD is preferred82Classical DiscriminationSummary of Classical Ideas:Among Simple MethodsMD and FLD sometimes similarSometimes FLD betterSo FLD is preferredAmong Complicated MethodsGLR is bestSo always use that?83Classical DiscriminationSummary of Classical Ideas:Among Simple MethodsMD and FLD sometimes similarSometimes FLD betterSo FLD is preferredAmong Complicated MethodsGLR is bestSo always use that?Caution:Story changes for HDLSS settings84HDLSS DiscriminationMain HDLSS issues:Sample Size, n < Dimension, d85HDLSS DiscriminationMain HDLSS issues:Sample Size, n < Dimension, dSingular covariance matrixSo cant use matrix inverse86HDLSS DiscriminationMain HDLSS issues:Sample Size, n < Dimension, dSingular covariance matrixSo cant use matrix inverseI.e. cant standardize (sphere) the data(requires root inverse covariance)Cant do classical multivariate analysis87HDLSS DiscriminationMain HDLSS issues:

Cant do classical multivariate analysisKey Idea: StandardizeSubtract Sample MeanSphere the DataAnd use N(0,I) distn for statistical inference

88HDLSS DiscriminationAn approach to non-invertible covariances:89HDLSS DiscriminationAn approach to non-invertible covariances:Replace by generalized inversesSometimes called pseudo inversesNote: there are several90HDLSS DiscriminationAn approach to non-invertible covariances:Replace by generalized inversesSometimes called pseudo inversesNote: there are severalHere use Moore Penrose inverseAs used by Matlab (pinv.m)91HDLSS DiscriminationAn approach to non-invertible covariances:Replace by generalized inversesSometimes called pseudo inversesNote: there are severalHere use Moore Penrose inverseAs used by Matlab (pinv.m)Often provides useful results(but not always)92HDLSS DiscriminationApplication of Generalized Inverse to FLD:

Direction (Normal) Vector:

Intercept:

93HDLSS DiscriminationApplication of Generalized Inverse to FLD:

Direction (Normal) Vector:

Intercept:

Have replaced by

94HDLSS DiscriminationToy Example: Increasing Dimension data vectors:Entry 1: Class +1: Class 1:Other Entries:All Entries Independent

Look through dimensions,

95HDLSS DiscriminationIncreasing Dimension ExampleProj. on Optl DirnProj. on FLD DirnProj. on both Dirns

96HDLSS DiscriminationAdd a 2nd Dimension (noise)Same Proj. on Optl DirnAxes same as dirnsNow See 2 Dimns

97HDLSS DiscriminationAdd a 3rd Dimension (noise)Project on 2-d subspace generated by optimal dirn & by FLD dirn

98HDLSS DiscriminationMovie Through Increasing Dimensions

99