etc3250: dimension reduction · etc3250: dimension reduction semester 1, 2020 professor di cook...

36
ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a)

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

ETC3250: Dimension reductionSemester 1, 2020

Professor Di Cook

Econometrics and Business Statistics Monash University

Week 4 (a)

Page 2: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Space is big. You just won't believe how vastly, hugely, mind-bogglinglybig it is. I mean, you may think it's a long way down the road to thechemist's, but that's just peanuts to space.

Douglas Adams, Hitchhiker's Guide to the Galaxy

2 / 36

Page 3: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

High Dimensional Data

Remember, our data can be denoted as:

then

Dimension of the data is p, the number of variables.

D = {(xi, yi)}Ni=1,    where xi = (xi1, … , xip)T

3 / 36

Page 4: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Cubes and Spheres

Space expands exponentially with dimension:

As dimension increases the volume of a sphere of same radius as cubeside length becomes much smaller than the volume of the cube.

4 / 36

Page 5: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Examples of High Dimensional Data

High dimensional data occurs commonly in bioinformatics, whengenetic studies often have many more information on genes thanpatients.

5 / 36

Page 6: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Example - SRBCT cancerprediction

The SRBCT dataset (Khan et al.,2001) looks at classifying 4 classesof different childhood tumourssharing similar visual featuresduring routine histology. Data contains 83 microarraysamples with 1586 features. We will revisit this data later onin the course to explore highdimensional Discriminant Analysis. Source: Nature

6 / 36

Page 7: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Multivariate dataMostly though, we're working on problems where , and .This would more commonly be referred to as multivariate data.

n >> p p > 1

7 / 36

Page 8: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Sub-spaces

Data will often be con�ned to a region of the space having lowerintrinsic dimensionality. The data lives in a low-dimensional subspace.

Analyse the data by, reducing dimensionality, to the subspacecontaining the data.

8 / 36

Page 9: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Principal Component Analysis (PCA)

Principal component analysis (PCA) produces a low-dimensionalrepresentation of a dataset. It �nds a sequence of linear combinationsof the variables that have maximal variance, and are mutuallyuncorrelated. It is an unsupervised learning method.

9 / 36

Page 10: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Why use PCA?

We may have too many predictors for a regression. Instead, we canuse the �rst few principal components. Understanding relationships between variables. Data visualization. We can plot a small number of variables moreeasily than a large number of variables.

10 / 36

Page 11: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

First principal component

The �rst principal component of a set of variables is thelinear combination

that has the largest variance such that

The elements are the loadings of the �rst principalcomponent.

x1, x2, … , xp

z1 = ϕ11x1 + ϕ21x2 + ⋯ + ϕp1xp

p

∑j=1

ϕ2j1

= 1

ϕ11, … , ϕp1

11 / 36

Page 12: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Geometry

The loading vector de�nes direction in featurespace along which data vary most. If we project the data points onto this direction, theprojected values are the principal component scores . The second principal component is the linear combination

that has maximal varianceamong all linear combinations that are uncorrelated with . Equivalent to constraining to be orthogonal (perpendicular) to . And so on. There are at most PCs.

ϕ1 = [ϕ11, … , ϕp1]′

n x1, … , xn

z11, … , zn1

zi2 = ϕ12xi1 + ϕ22xi2 + ⋯ + ϕp2xip

z1

ϕ2 ϕ1

min(n − 1, p)

12 / 36

Page 13: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Example

First PC; second PC13 / 36

Page 14: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Example

If you think of the �rst few PCs like a linear model �t, and the others asthe error, it is like regression, except that errors are orthogonal tomodel. 14 / 36

Page 15: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Computation

PCA can be thought of as �tting an -dimensional ellipsoid to the data,where each axis of the ellipsoid represents a principal component. Thenew variables produced by principal components correspond torotating and scaling the ellipse into a circle.

n

15 / 36

Page 16: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Computation

Suppose we have a data set .

Centre each of the variables to have mean zero (i.e., the columnmeans of are zero).

Sample variance of is .

n × p X = [xij]

Xzi1 = ϕ11xi1 + ϕ21xi2 + ⋯ + ϕp1xip

zi1

n

∑i=1

z2i1

1

n

maximizeϕ11,…,ϕp1

n

∑i=1

(p

∑j=1

ϕj1xij)

2

 subject to 

p

∑j=1

ϕ2j1 = 1

1

n

16 / 36

Page 17: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Computation

1. Compute the covariance matrix (after scaling the columns of )

2. Find eigenvalues and eigenvectors:

where columns of are orthonormal (i.e., )

3. Compute PCs: . .

X

C = X′X

C = VDV′

V V′V = I

Φ = V Z = XΦ

17 / 36

Page 18: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Singular Value Decomposition

is an matrix is matrix with orthonormal columns ( ) is diagonal matrix with non-negative elements. is matrix with orthonormal columns ( ).

It is always possible to uniquely decompose a matrix in this way.

X = UΛV ′

X n × p

U n × r U ′U = IΛ r × rV p × r V ′V = I

18 / 36

Page 19: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Computation

1. Compute SVD: .

2. Compute PCs: . .

Relationship with covariance:

Eigenvalues of are squares of singular values of . Eigenvectors of are right singular vectors of . The PC directions are the right singular vectors ofthe matrix .

X = UΛV ′

Φ = V Z = XΦ

C = X ′X = V ΛU ′UΛV ′ = V Λ2V ′ = V DV ′

C XC X

ϕ1, ϕ2, ϕ3, …X

19 / 36

Page 20: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Total variance

Total variance in data (assuming variables centered at 0):

If variables are standardised, TV=number of variables!

Variance explained by m'th PC:

TV =

p

∑j=1

Var(xj) =

p

∑j=1

n

∑i=1

x2ij

1

n

Vm = Var(zm) = ∑n

i=1 z2im

1n

TV =M

∑m=1

Vm where M = min(n − 1, p).

20 / 36

Page 21: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

How to choose ?

PCA is a useful dimension reduction technique for large datasets, butdeciding on how many dimensions to keep isn't often clear. 🤔

Think: How do we know how many principal components to choose?

k

21 / 36

Page 22: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

How to choose ?

Proportion of variance explained:

Choosing the number of PCs that adequately summarises the variationin , is achieved by examining the cumulative proportion of varianceexplained.

Cumulative proportion of variance explained:

k

PVEm =Vm

TV

X

CPVEk =

k

∑m=1

Vm

TV22 / 36

Page 23: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

How to choose ?

Scree plot: Plot of variance explainedby each component vs number ofcomponent.

k

23 / 36

Page 24: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

How to choose ?

Scree plot: Plot of variance explainedby each component vs number ofcomponent.

k

24 / 36

Page 25: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Example - track records

The data on national track records for women (as at 1984).

## Rows: 55## Columns: 8## $ m100 <dbl> 11.61, 11.20, 11.43, 11.41, 11.46, 11.31, 12.14, 11.00, 12.0…## $ m200 <dbl> 22.94, 22.35, 23.09, 23.04, 23.05, 23.17, 24.47, 22.25, 24.5…## $ m400 <dbl> 54.50, 51.08, 50.62, 52.00, 53.30, 52.80, 55.00, 50.06, 54.9…## $ m800 <dbl> 2.15, 1.98, 1.99, 2.00, 2.16, 2.10, 2.18, 2.00, 2.05, 2.08, …## $ m1500 <dbl> 4.43, 4.13, 4.22, 4.14, 4.58, 4.49, 4.45, 4.06, 4.23, 4.33, …## $ m3000 <dbl> 9.79, 9.08, 9.34, 8.88, 9.81, 9.77, 9.51, 8.81, 9.37, 9.31, …## $ marathon <dbl> 178.52, 152.37, 159.37, 157.85, 169.98, 168.75, 191.02, 149.…## $ country <chr> "argentin", "australi", "austria", "belgium", "bermuda", "br…

Source: Johnson and Wichern, Applied multivariate analysis

25 / 36

Page 26: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Explore the data

26 / 36

Page 27: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Compute PCA

track_pca <- prcomp(track[,1:7], center=TRUE, scale=TRUE)track_pca

## Standard deviations (1, .., p=7):## [1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15## ## Rotation (n x k) = (7 x 7):## PC1 PC2 PC3 PC4 PC5 PC6 PC7## m100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052## m200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109## m400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208## m800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315## m1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693## m3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598## marathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070

27 / 36

Page 28: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Assess

Summary of the principal components:

PC1 PC2 PC3 PC4 PC5 PC6 PC7Variance 5.81 0.65 0.30 0.13 0.05 0.04 0.02Proportion 0.83 0.09 0.04 0.02 0.01 0.01 0.00Cum. prop 0.83 0.92 0.97 0.98 0.99 1.00 1.00

Increase in variance explained large until PCs, and then tapersoff. A choice of 3 PCs would explain 97% of the total variance.

k = 3

28 / 36

Page 29: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Assess

Scree plot: Where is the elbow?

At , thus the scree plotsuggests 2 PCs would be suf�cient toexplain the variability.

k = 2

29 / 36

Page 30: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Assess

Visualise model using a biplot: Plotthe principal component scores, andalso the contribution of the originalvariables to the principal component.

30 / 36

Page 31: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Signi�cance of loadingsBootstrap can be used to assess whether the coef�cients of a PC aresigni�cantly different from 0. The 95% bootstrap con�dence intervalscan be computed by:

1. Generating B bootstrap samples of the data2. Compute PCA, record the loadings3. Re-orient the loadings, by choosing one variable with large coef�cient

to be the direction base4. If B=1000, 25th and 975th sorted values yields the lower and upper

bounds for con�dence interval for each PC.

31 / 36

Page 32: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

All of the coef�cients on PC1 are signi�cantly different from 0, andpositive, approximately equal, not signi�cantly different from beingequal.

32 / 36

Page 33: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Loadings for PC2

On PC2 m100 and m200 contrast m1500 and m3000 (and possiblymarathon). These are signi�cantly different from 0.

33 / 36

Page 34: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Loadings for PC3

On PC3 m400 and m800 (and possibly marathon) are signi�cantlydifferent from 0.

34 / 36

Page 35: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

Interpretation

PC1 measures overall magnitude, the strength of the athleticsprogram. High positive values indicate poor programs with generallyslow times across events. PC2 measures the contrast in the program between short and longdistance events. Some countries have relatively stronger long distanceatheletes, while others have relatively stronger short distance athletes. There are several outliers visible in this plot, wsamoa, cookis,dpkorea. PCA, because it is computed using the variance in the data,can be affected by outliers. It may be better to remove these countries,and re-run the PCA. PC3, may or may not be useful to keep. The interpretation wouldthat this variable summarises countries with different middle distanceperformance.

35 / 36

Page 36: ETC3250: Dimension reduction · ETC3250: Dimension reduction Semester 1, 2020 Professor Di Cook Econometrics and Business Statistics Monash University Week 4 (a) Space is big. You

� Made by a human with a computerSlides at https://iml.numbat.space.

Code and data at https://github.com/numbats/iml.

Created using R Markdown with �air by xaringan, andkunoichi (female ninja) style.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

36 / 36