etc3250: dimension reduction · etc3250: dimension reduction semester 1, 2020 professor di cook...

Post on 15-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ETC3250: Dimension reductionSemester 1, 2020

Professor Di Cook

Econometrics and Business Statistics Monash University

Week 4 (a)

Space is big. You just won't believe how vastly, hugely, mind-bogglinglybig it is. I mean, you may think it's a long way down the road to thechemist's, but that's just peanuts to space.

Douglas Adams, Hitchhiker's Guide to the Galaxy

2 / 36

High Dimensional Data

Remember, our data can be denoted as:

then

Dimension of the data is p, the number of variables.

D = {(xi, yi)}Ni=1,    where xi = (xi1, … , xip)T

3 / 36

Cubes and Spheres

Space expands exponentially with dimension:

As dimension increases the volume of a sphere of same radius as cubeside length becomes much smaller than the volume of the cube.

4 / 36

Examples of High Dimensional Data

High dimensional data occurs commonly in bioinformatics, whengenetic studies often have many more information on genes thanpatients.

5 / 36

Example - SRBCT cancerprediction

The SRBCT dataset (Khan et al.,2001) looks at classifying 4 classesof different childhood tumourssharing similar visual featuresduring routine histology. Data contains 83 microarraysamples with 1586 features. We will revisit this data later onin the course to explore highdimensional Discriminant Analysis. Source: Nature

6 / 36

Multivariate dataMostly though, we're working on problems where , and .This would more commonly be referred to as multivariate data.

n >> p p > 1

7 / 36

Sub-spaces

Data will often be con�ned to a region of the space having lowerintrinsic dimensionality. The data lives in a low-dimensional subspace.

Analyse the data by, reducing dimensionality, to the subspacecontaining the data.

8 / 36

Principal Component Analysis (PCA)

Principal component analysis (PCA) produces a low-dimensionalrepresentation of a dataset. It �nds a sequence of linear combinationsof the variables that have maximal variance, and are mutuallyuncorrelated. It is an unsupervised learning method.

9 / 36

Why use PCA?

We may have too many predictors for a regression. Instead, we canuse the �rst few principal components. Understanding relationships between variables. Data visualization. We can plot a small number of variables moreeasily than a large number of variables.

10 / 36

First principal component

The �rst principal component of a set of variables is thelinear combination

that has the largest variance such that

The elements are the loadings of the �rst principalcomponent.

x1, x2, … , xp

z1 = ϕ11x1 + ϕ21x2 + ⋯ + ϕp1xp

p

∑j=1

ϕ2j1

= 1

ϕ11, … , ϕp1

11 / 36

Geometry

The loading vector de�nes direction in featurespace along which data vary most. If we project the data points onto this direction, theprojected values are the principal component scores . The second principal component is the linear combination

that has maximal varianceamong all linear combinations that are uncorrelated with . Equivalent to constraining to be orthogonal (perpendicular) to . And so on. There are at most PCs.

ϕ1 = [ϕ11, … , ϕp1]′

n x1, … , xn

z11, … , zn1

zi2 = ϕ12xi1 + ϕ22xi2 + ⋯ + ϕp2xip

z1

ϕ2 ϕ1

min(n − 1, p)

12 / 36

Example

First PC; second PC13 / 36

Example

If you think of the �rst few PCs like a linear model �t, and the others asthe error, it is like regression, except that errors are orthogonal tomodel. 14 / 36

Computation

PCA can be thought of as �tting an -dimensional ellipsoid to the data,where each axis of the ellipsoid represents a principal component. Thenew variables produced by principal components correspond torotating and scaling the ellipse into a circle.

n

15 / 36

Computation

Suppose we have a data set .

Centre each of the variables to have mean zero (i.e., the columnmeans of are zero).

Sample variance of is .

n × p X = [xij]

Xzi1 = ϕ11xi1 + ϕ21xi2 + ⋯ + ϕp1xip

zi1

n

∑i=1

z2i1

1

n

maximizeϕ11,…,ϕp1

n

∑i=1

(p

∑j=1

ϕj1xij)

2

 subject to 

p

∑j=1

ϕ2j1 = 1

1

n

16 / 36

Computation

1. Compute the covariance matrix (after scaling the columns of )

2. Find eigenvalues and eigenvectors:

where columns of are orthonormal (i.e., )

3. Compute PCs: . .

X

C = X′X

C = VDV′

V V′V = I

Φ = V Z = XΦ

17 / 36

Singular Value Decomposition

is an matrix is matrix with orthonormal columns ( ) is diagonal matrix with non-negative elements. is matrix with orthonormal columns ( ).

It is always possible to uniquely decompose a matrix in this way.

X = UΛV ′

X n × p

U n × r U ′U = IΛ r × rV p × r V ′V = I

18 / 36

Computation

1. Compute SVD: .

2. Compute PCs: . .

Relationship with covariance:

Eigenvalues of are squares of singular values of . Eigenvectors of are right singular vectors of . The PC directions are the right singular vectors ofthe matrix .

X = UΛV ′

Φ = V Z = XΦ

C = X ′X = V ΛU ′UΛV ′ = V Λ2V ′ = V DV ′

C XC X

ϕ1, ϕ2, ϕ3, …X

19 / 36

Total variance

Total variance in data (assuming variables centered at 0):

If variables are standardised, TV=number of variables!

Variance explained by m'th PC:

TV =

p

∑j=1

Var(xj) =

p

∑j=1

n

∑i=1

x2ij

1

n

Vm = Var(zm) = ∑n

i=1 z2im

1n

TV =M

∑m=1

Vm where M = min(n − 1, p).

20 / 36

How to choose ?

PCA is a useful dimension reduction technique for large datasets, butdeciding on how many dimensions to keep isn't often clear. 🤔

Think: How do we know how many principal components to choose?

k

21 / 36

How to choose ?

Proportion of variance explained:

Choosing the number of PCs that adequately summarises the variationin , is achieved by examining the cumulative proportion of varianceexplained.

Cumulative proportion of variance explained:

k

PVEm =Vm

TV

X

CPVEk =

k

∑m=1

Vm

TV22 / 36

How to choose ?

Scree plot: Plot of variance explainedby each component vs number ofcomponent.

k

23 / 36

How to choose ?

Scree plot: Plot of variance explainedby each component vs number ofcomponent.

k

24 / 36

Example - track records

The data on national track records for women (as at 1984).

## Rows: 55## Columns: 8## $ m100 <dbl> 11.61, 11.20, 11.43, 11.41, 11.46, 11.31, 12.14, 11.00, 12.0…## $ m200 <dbl> 22.94, 22.35, 23.09, 23.04, 23.05, 23.17, 24.47, 22.25, 24.5…## $ m400 <dbl> 54.50, 51.08, 50.62, 52.00, 53.30, 52.80, 55.00, 50.06, 54.9…## $ m800 <dbl> 2.15, 1.98, 1.99, 2.00, 2.16, 2.10, 2.18, 2.00, 2.05, 2.08, …## $ m1500 <dbl> 4.43, 4.13, 4.22, 4.14, 4.58, 4.49, 4.45, 4.06, 4.23, 4.33, …## $ m3000 <dbl> 9.79, 9.08, 9.34, 8.88, 9.81, 9.77, 9.51, 8.81, 9.37, 9.31, …## $ marathon <dbl> 178.52, 152.37, 159.37, 157.85, 169.98, 168.75, 191.02, 149.…## $ country <chr> "argentin", "australi", "austria", "belgium", "bermuda", "br…

Source: Johnson and Wichern, Applied multivariate analysis

25 / 36

Explore the data

26 / 36

Compute PCA

track_pca <- prcomp(track[,1:7], center=TRUE, scale=TRUE)track_pca

## Standard deviations (1, .., p=7):## [1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15## ## Rotation (n x k) = (7 x 7):## PC1 PC2 PC3 PC4 PC5 PC6 PC7## m100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052## m200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109## m400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208## m800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315## m1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693## m3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598## marathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070

27 / 36

Assess

Summary of the principal components:

PC1 PC2 PC3 PC4 PC5 PC6 PC7Variance 5.81 0.65 0.30 0.13 0.05 0.04 0.02Proportion 0.83 0.09 0.04 0.02 0.01 0.01 0.00Cum. prop 0.83 0.92 0.97 0.98 0.99 1.00 1.00

Increase in variance explained large until PCs, and then tapersoff. A choice of 3 PCs would explain 97% of the total variance.

k = 3

28 / 36

Assess

Scree plot: Where is the elbow?

At , thus the scree plotsuggests 2 PCs would be suf�cient toexplain the variability.

k = 2

29 / 36

Assess

Visualise model using a biplot: Plotthe principal component scores, andalso the contribution of the originalvariables to the principal component.

30 / 36

Signi�cance of loadingsBootstrap can be used to assess whether the coef�cients of a PC aresigni�cantly different from 0. The 95% bootstrap con�dence intervalscan be computed by:

1. Generating B bootstrap samples of the data2. Compute PCA, record the loadings3. Re-orient the loadings, by choosing one variable with large coef�cient

to be the direction base4. If B=1000, 25th and 975th sorted values yields the lower and upper

bounds for con�dence interval for each PC.

31 / 36

All of the coef�cients on PC1 are signi�cantly different from 0, andpositive, approximately equal, not signi�cantly different from beingequal.

32 / 36

Loadings for PC2

On PC2 m100 and m200 contrast m1500 and m3000 (and possiblymarathon). These are signi�cantly different from 0.

33 / 36

Loadings for PC3

On PC3 m400 and m800 (and possibly marathon) are signi�cantlydifferent from 0.

34 / 36

Interpretation

PC1 measures overall magnitude, the strength of the athleticsprogram. High positive values indicate poor programs with generallyslow times across events. PC2 measures the contrast in the program between short and longdistance events. Some countries have relatively stronger long distanceatheletes, while others have relatively stronger short distance athletes. There are several outliers visible in this plot, wsamoa, cookis,dpkorea. PCA, because it is computed using the variance in the data,can be affected by outliers. It may be better to remove these countries,and re-run the PCA. PC3, may or may not be useful to keep. The interpretation wouldthat this variable summarises countries with different middle distanceperformance.

35 / 36

� Made by a human with a computerSlides at https://iml.numbat.space.

Code and data at https://github.com/numbats/iml.

Created using R Markdown with �air by xaringan, andkunoichi (female ninja) style.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

36 / 36

top related