![Page 1: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/1.jpg)
ETC3250: Dimension reductionSemester 1, 2020
Professor Di Cook
Econometrics and Business Statistics Monash University
Week 4 (a)
![Page 2: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/2.jpg)
Space is big. You just won't believe how vastly, hugely, mind-bogglinglybig it is. I mean, you may think it's a long way down the road to thechemist's, but that's just peanuts to space.
Douglas Adams, Hitchhiker's Guide to the Galaxy
2 / 36
![Page 3: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/3.jpg)
High Dimensional Data
Remember, our data can be denoted as:
then
Dimension of the data is p, the number of variables.
D = {(xi, yi)}Ni=1, where xi = (xi1, … , xip)T
3 / 36
![Page 4: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/4.jpg)
Cubes and Spheres
Space expands exponentially with dimension:
As dimension increases the volume of a sphere of same radius as cubeside length becomes much smaller than the volume of the cube.
4 / 36
![Page 5: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/5.jpg)
Examples of High Dimensional Data
High dimensional data occurs commonly in bioinformatics, whengenetic studies often have many more information on genes thanpatients.
5 / 36
![Page 6: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/6.jpg)
Example - SRBCT cancerprediction
The SRBCT dataset (Khan et al.,2001) looks at classifying 4 classesof different childhood tumourssharing similar visual featuresduring routine histology. Data contains 83 microarraysamples with 1586 features. We will revisit this data later onin the course to explore highdimensional Discriminant Analysis. Source: Nature
6 / 36
![Page 7: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/7.jpg)
Multivariate dataMostly though, we're working on problems where , and .This would more commonly be referred to as multivariate data.
n >> p p > 1
7 / 36
![Page 8: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/8.jpg)
Sub-spaces
Data will often be con�ned to a region of the space having lowerintrinsic dimensionality. The data lives in a low-dimensional subspace.
Analyse the data by, reducing dimensionality, to the subspacecontaining the data.
8 / 36
![Page 9: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/9.jpg)
Principal Component Analysis (PCA)
Principal component analysis (PCA) produces a low-dimensionalrepresentation of a dataset. It �nds a sequence of linear combinationsof the variables that have maximal variance, and are mutuallyuncorrelated. It is an unsupervised learning method.
9 / 36
![Page 10: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/10.jpg)
Why use PCA?
We may have too many predictors for a regression. Instead, we canuse the �rst few principal components. Understanding relationships between variables. Data visualization. We can plot a small number of variables moreeasily than a large number of variables.
10 / 36
![Page 11: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/11.jpg)
First principal component
The �rst principal component of a set of variables is thelinear combination
that has the largest variance such that
The elements are the loadings of the �rst principalcomponent.
x1, x2, … , xp
z1 = ϕ11x1 + ϕ21x2 + ⋯ + ϕp1xp
p
∑j=1
ϕ2j1
= 1
ϕ11, … , ϕp1
11 / 36
![Page 12: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/12.jpg)
Geometry
The loading vector de�nes direction in featurespace along which data vary most. If we project the data points onto this direction, theprojected values are the principal component scores . The second principal component is the linear combination
that has maximal varianceamong all linear combinations that are uncorrelated with . Equivalent to constraining to be orthogonal (perpendicular) to . And so on. There are at most PCs.
ϕ1 = [ϕ11, … , ϕp1]′
n x1, … , xn
z11, … , zn1
zi2 = ϕ12xi1 + ϕ22xi2 + ⋯ + ϕp2xip
z1
ϕ2 ϕ1
min(n − 1, p)
12 / 36
![Page 14: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/14.jpg)
Example
If you think of the �rst few PCs like a linear model �t, and the others asthe error, it is like regression, except that errors are orthogonal tomodel. 14 / 36
![Page 15: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/15.jpg)
Computation
PCA can be thought of as �tting an -dimensional ellipsoid to the data,where each axis of the ellipsoid represents a principal component. Thenew variables produced by principal components correspond torotating and scaling the ellipse into a circle.
n
15 / 36
![Page 16: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/16.jpg)
Computation
Suppose we have a data set .
Centre each of the variables to have mean zero (i.e., the columnmeans of are zero).
Sample variance of is .
n × p X = [xij]
Xzi1 = ϕ11xi1 + ϕ21xi2 + ⋯ + ϕp1xip
zi1
n
∑i=1
z2i1
1
n
maximizeϕ11,…,ϕp1
n
∑i=1
(p
∑j=1
ϕj1xij)
2
subject to
p
∑j=1
ϕ2j1 = 1
1
n
16 / 36
![Page 17: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/17.jpg)
Computation
1. Compute the covariance matrix (after scaling the columns of )
2. Find eigenvalues and eigenvectors:
where columns of are orthonormal (i.e., )
3. Compute PCs: . .
X
C = X′X
C = VDV′
V V′V = I
Φ = V Z = XΦ
17 / 36
![Page 18: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/18.jpg)
Singular Value Decomposition
is an matrix is matrix with orthonormal columns ( ) is diagonal matrix with non-negative elements. is matrix with orthonormal columns ( ).
It is always possible to uniquely decompose a matrix in this way.
X = UΛV ′
X n × p
U n × r U ′U = IΛ r × rV p × r V ′V = I
18 / 36
![Page 19: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/19.jpg)
Computation
1. Compute SVD: .
2. Compute PCs: . .
Relationship with covariance:
Eigenvalues of are squares of singular values of . Eigenvectors of are right singular vectors of . The PC directions are the right singular vectors ofthe matrix .
X = UΛV ′
Φ = V Z = XΦ
C = X ′X = V ΛU ′UΛV ′ = V Λ2V ′ = V DV ′
C XC X
ϕ1, ϕ2, ϕ3, …X
19 / 36
![Page 20: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/20.jpg)
Total variance
Total variance in data (assuming variables centered at 0):
If variables are standardised, TV=number of variables!
Variance explained by m'th PC:
TV =
p
∑j=1
Var(xj) =
p
∑j=1
n
∑i=1
x2ij
1
n
Vm = Var(zm) = ∑n
i=1 z2im
1n
TV =M
∑m=1
Vm where M = min(n − 1, p).
20 / 36
![Page 21: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/21.jpg)
How to choose ?
PCA is a useful dimension reduction technique for large datasets, butdeciding on how many dimensions to keep isn't often clear. 🤔
Think: How do we know how many principal components to choose?
k
21 / 36
![Page 22: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/22.jpg)
How to choose ?
Proportion of variance explained:
Choosing the number of PCs that adequately summarises the variationin , is achieved by examining the cumulative proportion of varianceexplained.
Cumulative proportion of variance explained:
k
PVEm =Vm
TV
X
CPVEk =
k
∑m=1
Vm
TV22 / 36
![Page 23: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/23.jpg)
How to choose ?
Scree plot: Plot of variance explainedby each component vs number ofcomponent.
k
23 / 36
![Page 24: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/24.jpg)
How to choose ?
Scree plot: Plot of variance explainedby each component vs number ofcomponent.
k
24 / 36
![Page 25: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/25.jpg)
Example - track records
The data on national track records for women (as at 1984).
## Rows: 55## Columns: 8## $ m100 <dbl> 11.61, 11.20, 11.43, 11.41, 11.46, 11.31, 12.14, 11.00, 12.0…## $ m200 <dbl> 22.94, 22.35, 23.09, 23.04, 23.05, 23.17, 24.47, 22.25, 24.5…## $ m400 <dbl> 54.50, 51.08, 50.62, 52.00, 53.30, 52.80, 55.00, 50.06, 54.9…## $ m800 <dbl> 2.15, 1.98, 1.99, 2.00, 2.16, 2.10, 2.18, 2.00, 2.05, 2.08, …## $ m1500 <dbl> 4.43, 4.13, 4.22, 4.14, 4.58, 4.49, 4.45, 4.06, 4.23, 4.33, …## $ m3000 <dbl> 9.79, 9.08, 9.34, 8.88, 9.81, 9.77, 9.51, 8.81, 9.37, 9.31, …## $ marathon <dbl> 178.52, 152.37, 159.37, 157.85, 169.98, 168.75, 191.02, 149.…## $ country <chr> "argentin", "australi", "austria", "belgium", "bermuda", "br…
Source: Johnson and Wichern, Applied multivariate analysis
25 / 36
![Page 26: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/26.jpg)
Explore the data
26 / 36
![Page 27: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/27.jpg)
Compute PCA
track_pca <- prcomp(track[,1:7], center=TRUE, scale=TRUE)track_pca
## Standard deviations (1, .., p=7):## [1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15## ## Rotation (n x k) = (7 x 7):## PC1 PC2 PC3 PC4 PC5 PC6 PC7## m100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052## m200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109## m400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208## m800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315## m1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693## m3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598## marathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070
27 / 36
![Page 28: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/28.jpg)
Assess
Summary of the principal components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7Variance 5.81 0.65 0.30 0.13 0.05 0.04 0.02Proportion 0.83 0.09 0.04 0.02 0.01 0.01 0.00Cum. prop 0.83 0.92 0.97 0.98 0.99 1.00 1.00
Increase in variance explained large until PCs, and then tapersoff. A choice of 3 PCs would explain 97% of the total variance.
k = 3
28 / 36
![Page 29: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/29.jpg)
Assess
Scree plot: Where is the elbow?
At , thus the scree plotsuggests 2 PCs would be suf�cient toexplain the variability.
k = 2
29 / 36
![Page 30: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/30.jpg)
Assess
Visualise model using a biplot: Plotthe principal component scores, andalso the contribution of the originalvariables to the principal component.
30 / 36
![Page 31: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/31.jpg)
Signi�cance of loadingsBootstrap can be used to assess whether the coef�cients of a PC aresigni�cantly different from 0. The 95% bootstrap con�dence intervalscan be computed by:
1. Generating B bootstrap samples of the data2. Compute PCA, record the loadings3. Re-orient the loadings, by choosing one variable with large coef�cient
to be the direction base4. If B=1000, 25th and 975th sorted values yields the lower and upper
bounds for con�dence interval for each PC.
31 / 36
![Page 32: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/32.jpg)
All of the coef�cients on PC1 are signi�cantly different from 0, andpositive, approximately equal, not signi�cantly different from beingequal.
32 / 36
![Page 33: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/33.jpg)
Loadings for PC2
On PC2 m100 and m200 contrast m1500 and m3000 (and possiblymarathon). These are signi�cantly different from 0.
33 / 36
![Page 34: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/34.jpg)
Loadings for PC3
On PC3 m400 and m800 (and possibly marathon) are signi�cantlydifferent from 0.
34 / 36
![Page 35: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/35.jpg)
Interpretation
PC1 measures overall magnitude, the strength of the athleticsprogram. High positive values indicate poor programs with generallyslow times across events. PC2 measures the contrast in the program between short and longdistance events. Some countries have relatively stronger long distanceatheletes, while others have relatively stronger short distance athletes. There are several outliers visible in this plot, wsamoa, cookis,dpkorea. PCA, because it is computed using the variance in the data,can be affected by outliers. It may be better to remove these countries,and re-run the PCA. PC3, may or may not be useful to keep. The interpretation wouldthat this variable summarises countries with different middle distanceperformance.
35 / 36
![Page 36: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You](https://reader034.vdocuments.us/reader034/viewer/2022042913/5f4bde3ec59a4c5f111d00c1/html5/thumbnails/36.jpg)
� Made by a human with a computerSlides at https://iml.numbat.space.
Code and data at https://github.com/numbats/iml.
Created using R Markdown with �air by xaringan, andkunoichi (female ninja) style.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
36 / 36