detailed look at pca
DESCRIPTION
Detailed Look at PCA. Three Important (& Interesting) Viewpoints: Mathematics Numerics Statistics 1 st : Review Linear A lg. and M ultivar . Prob. Different Views of PCA. Solves several optimization problems: Direction to maximize SS of 1-d proj’d data - PowerPoint PPT PresentationTRANSCRIPT
Title
Detailed Look at PCA
Three Important (& Interesting) Viewpoints: Mathematics Numerics Statistics1st: Review Linear Alg. and Multivar. Prob.1Different Views of PCA
Solves several optimization problems:Direction to maximize SS of 1-d projd dataDirection to minimize SS of residuals (same, by Pythagorean Theorem)Best fit line to data in orthogonal sense(vs. regression of Y on X = vertical sense& regression of X on Y = horizontal sense)2Different Views of PCA
ProjectedResiduals3Different Views of PCA
VerticalResiduals(X predicts Y)4PCA Data Representn (Cont.)Now Using:
Spectral Representation (Raw Data):
Where:Entries of are LoadingsEntries of are Scores
5PCA Data Representn (Cont.)Reduced Rank Representation:
Reconstruct Using Only Terms(Assuming Decreasing Eigenvalues)
6PCA Data Representn (Cont.)SCREE Plot Drawbacks: What is a Knee? What if There are Several? Knees Depend on Scaling (Power? log?)
Personal Suggestions:Find Auxiliary Cutoffs (Inter-Rater Variation)Use the Full Range (ala Scale Space)7Alternate PCA ComputationAnother Variation: Dual PCA
Idea: Useful to View Data Matrix as
Colns as Data Objects
8Alternate PCA ComputationAnother Variation: Dual PCA
Idea: Useful to View Data Matrix as Both
Colns as Data Objects &Rows as Data Objects
9Alternate PCA ComputationDual PCA Computation:Same as above, but replace withSo can almost replacewithThen use SVD, , to get:
Note: Same Eigenvalues
10Return to Big PictureMain statistical goals of OODA:Understanding population structureLow dimal Projections, PCA Classification (i. e. Discrimination)Understanding 2+ populationsTime Series of Data ObjectsChemical Spectra, Mortality DataVertical Integration of Data Types
11Classification - DiscriminationBackground: Two Class (Binary) version:Using training data fromClass +1 and Class -1Develop a rule for assigning new data to a ClassCanonical Example: Disease DiagnosisNew Patients are Healthy or IllDetermine based on measurements12Classification - DiscriminationImportant Distinction: Classification vs. Clustering
Useful terminology:Classification: supervised learningClustering: unsupervised learning13Classification BasicsFor Simple Toy Example:
ProjectOn MD
& splitat center
14HDLSStmbMdif.psClassification BasicsPCA for slanted clouds:PC1 terriblePC2 better?Still missesright dirnDoesnt useClass Labels
15HDLSSod1PCA.psClassification BasicsMean Difference for slanted clouds:A little better?Still missesright dirnWant toaccount forcovariance
16HDLSSod1Mdif.psClassification BasicsBetter Solution: Fisher Linear Discrimination
Gets theright dirn
How does it work?
17HDLSSod1FLD.psFisher Linear DiscriminationOther common terminology (for FLD):Linear Discriminant Analysis (LDA)
Original Paper: Fisher (1936)18Fisher Linear DiscriminationCareful development:Useful notation (data vectors of length ):Class +1: Class -1:
Centerpoints:and
19Fisher Linear DiscriminationCovariances, for(outer products)Based on centered, normalized data matrices:
20Fisher Linear DiscriminationCovariances, for(outer products)Based on centered, normalized data matrices:
Note: use MLE version of estimated covariance matrices, for simpler notation
21Fisher Linear DiscriminationMajor Assumption:Class covariances are the same (or similar)Like this: Not this:
22HDLSSod1Raw.ps & HDLSSxd1Raw.psFisher Linear DiscriminationGood estimate of (common) within class cov?Pooled (weighted average) within class cov:
based on the combined full data matrix:
23Fisher Linear DiscriminationGood estimate of (common) within class cov?Pooled (weighted average) within class cov:
based on the combined full data matrix:
Note: Different Means
24Fisher Linear DiscriminationNote: is similar to from beforeI.e. covariance matrix ignoring class labelsImportant Difference:Class by Class CenteringWill be important later
25HDLSSod1Raw.psFisher Linear DiscriminationSimple way to find correct cov. adjustment:Individually transform subpopulations sospherical about their means
For define
26HDLSSod1egFLD.psFisher Linear DiscriminationFor define
Note: This spheres the data, in the sense that
27HDLSSod1egFLD.psFisher Linear DiscriminationThen:In Transformed Space,Best separating hyperplaneisPerpendicular bisector of line between means
28HDLSSod1egFLD.psFisher Linear DiscriminationIn Transformed Space,Separating Hyperplane has:Transformed Normal Vector:
29HDLSSod1egFLD.psFisher Linear DiscriminationIn Transformed Space,Separating Hyperplane has:Transformed Normal Vector:
Transformed Intercept:
30HDLSSod1egFLD.ps
Fisher Linear DiscriminationIn Transformed Space,Separating Hyperplane has:Transformed Normal Vector:
Transformed Intercept:
Sep. Hyperp. has Equation:
31HDLSSod1egFLD.psFisher Linear DiscriminationThus discrimination rule is:Given a new data vector ,Choose Class +1 when:
32Fisher Linear DiscriminationThus discrimination rule is:Given a new data vector ,Choose Class +1 when:
i.e. (transforming back to original space)
Using, for symmetric and invertible:
33Fisher Linear DiscriminationThus discrimination rule is:Given a new data vector ,Choose Class +1 when:
i.e. (transforming back to original space)
where:
34Fisher Linear DiscriminationSo (in origl space) have separting hyperplane with:Normal vector: Intercept:
35HDLSSod1egFLD.psFisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:
36Fisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:unit free, i.e. standardizedessentially mod out covariance structure
37Fisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:unit free, i.e. standardizedessentially mod out covariance structureEuclidean dist. applied to & Same as key transformation for FLD
38Fisher Linear DiscriminationRelationship to Mahalanobis distanceIdea: For , a natural distance measure is:unit free, i.e. standardizedessentially mod out covariance structureEuclidean dist. applied to & Same as key transformation for FLDI.e. FLD ismean difference in Mahalanobis space
39Classical DiscriminationAbove derivation of FLD was:NonstandardNot in any textbooks(?)40Classical DiscriminationAbove derivation of FLD was:NonstandardNot in any textbooks(?)Nonparametric (dont need Gaussian data)I.e. Used no probability distributions41Classical DiscriminationAbove derivation of FLD was:NonstandardNot in any textbooks(?)Nonparametric (dont need Gaussian data)I.e. Used no probability distributionsMore Machine Learning than Statistics42Classical DiscriminationFLD Likelihood View
43Classical DiscriminationFLD Likelihood ViewAssume: Class distributions are multivariate for
44Classical DiscriminationFLD Likelihood ViewAssume: Class distributions are multivariate for
strong distributional assumption + common covariance
45Classical DiscriminationFLD Likelihood View (cont.)At a location , the likelihood ratio, forchoosing between Class +1 and Class -1, is:
where is the Gaussian density with covariance
46Classical DiscriminationFLD Likelihood View (cont.)Simplifying, using the Gaussian density:
Gives
47Classical DiscriminationFLD Likelihood View (cont.)Simplifying, using the Gaussian density:
Gives (critically using common covariances):
48Classical DiscriminationFLD Likelihood View (cont.)Simplifying, using the Gaussian density:
Gives (critically using common covariances):
49Classical DiscriminationFLD Likelihood View (cont.)But:
for
50Classical DiscriminationFLD Likelihood View (cont.)But:
so:
Note: same terms subtract off
51Classical DiscriminationFLD Likelihood View (cont.)But:
so:
Note: cross terms have cancellation
52Classical DiscriminationFLD Likelihood View (cont.)But:
so:
Thus when
53Classical DiscriminationFLD Likelihood View (cont.)But:
so:
Thus when
i.e.
54Classical DiscriminationFLD Likelihood View (cont.)Replacing , and by maximum likelihood estimates:, and Gives the likelihood ratio discrimination rule
55Classical DiscriminationFLD Likelihood View (cont.)Replacing , and by maximum likelihood estimates:, and Gives the likelihood ratio discrimination rule:Choose Class +1, when
56Classical DiscriminationFLD Likelihood View (cont.)Replacing , and by maximum likelihood estimates:, and Gives the likelihood ratio discrimination rule:Choose Class +1, when
Same as above, so: FLD can be viewed asLikelihood Ratio Rule
57Classical DiscriminationFLD Generalization IGaussian Likelihood Ratio Discrimination(a. k. a. nonlinear discriminant analysis)58Classical DiscriminationFLD Generalization IGaussian Likelihood Ratio Discrimination(a. k. a. nonlinear discriminant analysis)Idea: Assume class distributions are Different covariances!
59Classical DiscriminationFLD Generalization IGaussian Likelihood Ratio Discrimination(a. k. a. nonlinear discriminant analysis)Idea: Assume class distributions are Different covariances!Likelihood Ratio rule is straightfd numl calc.(thus can easily implement, and do discrimn)
60Classical DiscriminationGaussian Likelihood Ratio Discrimn (cont.)No longer have separg hyperplane reprn(instead regions determined by quadratics)(fairly complicated case-wise calculations)
61Classical DiscriminationGaussian Likelihood Ratio Discrimn (cont.)No longer have separg hyperplane reprn(instead regions determined by quadratics)(fairly complicated case-wise calculations)
Graphical display: for each point, color as:Yellow if assigned to Class +1Cyan if assigned to Class -1(intensity is strength of assignment)62Classical DiscriminationFLD for Tilted Point Clouds Works well
63PEod1FLDe1.psClassical DiscriminationGLR for Tilted Point Clouds Works well
64PEod1GLRe1.psClassical DiscriminationFLD for Donut Poor, no plane can work
65PEdonFLDe1.ps
Classical DiscriminationGLR for Donut Works well (good quadratic)66PEdonGLRe1.ps
Classical DiscriminationGLR for Donut Works well (good quadratic)
(Even though data not Gaussian)67PEdonGLRe1.psClassical DiscriminationFLD for X Poor, no plane can work
68PExd3FLDe1.psClassical DiscriminationGLR for X Better, but not great
69PExd3GLRe1.psClassical DiscriminationSummary of FLD vs. GLR:Tilted Point Clouds DataFLD goodGLR goodDonut DataFLD badGLR goodX DataFLD badGLR OK, not greatClassical Conclusion: GLR generally better(will see a different answer for HDLSS data)70Classical DiscriminationFLD Generalization II (Gen. I was GLR)Different prior probabilitiesMain idea: Give different weights to 2 classesI.e. assume not a priori equally likelyDevelopment is straightforwardModified likelihoodChange intercept in FLDWont explore further here 71Classical DiscriminationFLD Generalization IIIPrincipal Discriminant AnalysisIdea: FLD-like approach to > 2 classes72Classical DiscriminationFLD Generalization IIIPrincipal Discriminant AnalysisIdea: FLD-like approach to > 2 classesAssumption: Class covariance matrices are the same (similar)(but not Gaussian, same situation as for FLD)73Classical DiscriminationFLD Generalization IIIPrincipal Discriminant AnalysisIdea: FLD-like approach to > 2 classesAssumption: Class covariance matrices are the same (similar)(but not Gaussian, same situation as for FLD)Main idea: Quantify location of classes by their means
74Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find interesting directions among the means:PCA on set of means75Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find interesting directions among the means:PCA on set of means
(Think Analog of Mean Difference)76Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find interesting directions among the means:PCA on set of meansi.e. Eigen-analysis of between class covariance matrixWhere
Aside: can show: overall
77Classical DiscriminationPrincipal Discriminant Analysis (cont.)But PCA only works like Mean Difference,Expect can improve bytaking covariance into account.
(Recall Improvement of FLD over MD)78Classical DiscriminationPrincipal Discriminant Analysis (cont.)But PCA only works like Mean Difference,Expect can improve bytaking covariance into account.Blind application of above ideas suggests eigen-analysis of:
79Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:smarter ways to compute (generalized eigenvalue)other representations (this solves optimization probs)80Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:smarter ways to compute (generalized eigenvalue)other representations (this solves optimization probs)Special case: 2 classes, reduces to standard FLDGood reference for more: Section 3.8 of:Duda, Hart & Stork (2001)81Classical DiscriminationSummary of Classical Ideas:Among Simple MethodsMD and FLD sometimes similarSometimes FLD betterSo FLD is preferred82Classical DiscriminationSummary of Classical Ideas:Among Simple MethodsMD and FLD sometimes similarSometimes FLD betterSo FLD is preferredAmong Complicated MethodsGLR is bestSo always use that?83Classical DiscriminationSummary of Classical Ideas:Among Simple MethodsMD and FLD sometimes similarSometimes FLD betterSo FLD is preferredAmong Complicated MethodsGLR is bestSo always use that?Caution:Story changes for HDLSS settings84HDLSS DiscriminationMain HDLSS issues:Sample Size, n < Dimension, d85HDLSS DiscriminationMain HDLSS issues:Sample Size, n < Dimension, dSingular covariance matrixSo cant use matrix inverse86HDLSS DiscriminationMain HDLSS issues:Sample Size, n < Dimension, dSingular covariance matrixSo cant use matrix inverseI.e. cant standardize (sphere) the data(requires root inverse covariance)Cant do classical multivariate analysis87HDLSS DiscriminationMain HDLSS issues:
Cant do classical multivariate analysisKey Idea: StandardizeSubtract Sample MeanSphere the DataAnd use N(0,I) distn for statistical inference
88HDLSS DiscriminationAn approach to non-invertible covariances:89HDLSS DiscriminationAn approach to non-invertible covariances:Replace by generalized inversesSometimes called pseudo inversesNote: there are several90HDLSS DiscriminationAn approach to non-invertible covariances:Replace by generalized inversesSometimes called pseudo inversesNote: there are severalHere use Moore Penrose inverseAs used by Matlab (pinv.m)91HDLSS DiscriminationAn approach to non-invertible covariances:Replace by generalized inversesSometimes called pseudo inversesNote: there are severalHere use Moore Penrose inverseAs used by Matlab (pinv.m)Often provides useful results(but not always)92HDLSS DiscriminationApplication of Generalized Inverse to FLD:
Direction (Normal) Vector:
Intercept:
93HDLSS DiscriminationApplication of Generalized Inverse to FLD:
Direction (Normal) Vector:
Intercept:
Have replaced by
94HDLSS DiscriminationToy Example: Increasing Dimension data vectors:Entry 1: Class +1: Class 1:Other Entries:All Entries Independent
Look through dimensions,
95HDLSS DiscriminationIncreasing Dimension ExampleProj. on Optl DirnProj. on FLD DirnProj. on both Dirns
96HDLSS DiscriminationAdd a 2nd Dimension (noise)Same Proj. on Optl DirnAxes same as dirnsNow See 2 Dimns
97HDLSS DiscriminationAdd a 3rd Dimension (noise)Project on 2-d subspace generated by optimal dirn & by FLD dirn
98HDLSS DiscriminationMovie Through Increasing Dimensions
99