principal component analysis, pca, in r · pca is the outcome of (a version of) the nipals...

eNote 2 1

eNote 2

Principal Component Analysis, PCA, in R

eNote 2 INDHOLD 2

Indhold

2 Principal Component Analysis, PCA, in R 12.1 Reading about PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Example: Fisher’s Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Basic explorative analysis . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 PCA of Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Spectral data example: yarn data . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1 Reading about PCA

You can use the Wehrens book, Chapter 4, pp 43-56:

http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/

page/1

and/or (probably better) the Varmuza-book, chapter 3, sections 3.1 - 3.7:

http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

The two R-packages chemometrics and ChemometricswithR, are companions to the twobooks.

Bro and Smilde (2014): Principal Component Analysis Analytical Methods TUTORIALREVIEW, 6, 2812.http://pubs.rsc.org/en/content/articlepdf/2014/ay/c3ay41907j

eNote 2 2.1 READING ABOUT PCA 3

Below there will be a number of important plots examplified as part of the iris-example:

1. Variance-plots (”scree-type”plots)

2. Scores, loadings and biplots (main plots for interpretation of structure)

3. Explained variances for each variable

4. Validation/diagnostics plots:

(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.the nice Figure 3.15, page 79 in the Varmuza-book)

(b) The ”influence plot”: residuals versus leverage

5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:

(a) Deciding on number of components

(b) Sensitivity/uncertainty investigation of scores and loadings.

What is PCA: Developed by Karl Pearson in 1901:

Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. PhilosophicalMagazine (6) 2: 559-572.

May also be called:

• Singular value decomposition

• Karhunen-Loeve expansion

• Eigenvector analysis

• Latent vector analysis

• Characteristic vector analysis

PCA is used for many things:

• Projection method

• Exploratory data analysis

• Extract information and remove noise

• Reduce dimensionality / Compression

• (Clustering)

And can be described/expressed in many ways:

• Produces optimal low-dimensional plots of observations (scores)

• Provides an overview of the variable correlation structure (loadings)

• Finds linear combinations of maximal variance

• Orthogonal distance regression method

• A bilinear model for the data

And can be described/expressed in many ways:

X : The (centered and scaled) n× p− data matrix

X = Observation Scores×Variable Loadings + Error

X = TPT + E

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 6

Xij =A

∑a=1

tia paj + eij

Computations/A bit of math:

• PCA finds X-components with maximal Y-variance:

max||α||=1

Var(Xα)

• PCA is the least squares fit of the bilinear (non linear regression) model:

mint,p ∑

ij(xij −

∑a=1

tia paj)2

• PCA is the eigen decomposition of XtX

• PCA is the eigen decomposition of XXt

• PCA is the outcome of (a version of) the NIPALS algorithm

2.2 Example: Fisher’s Iris Data

Below there will be an exercise based on these data with some questions that PCA can be helpfulin answering. Here we examplify a number of visualizations that one could do for such dataincluding PCA-based stuff.

The Fisher Iris data-set is classic, c.f.:

• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals ofEugenics 7: 179-188.

• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Socie-ty 59: 2-5.

There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW),petal length (PL) and petal width PW), all in all only four variables.

The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosax virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.

2.2.1 Data import

The iris data can allready be found within R, so no import is needed:

# Loading package related to Varmuza-book

# (First time you need to install the package)

library(ChemometricsWithRData)

library(ChemometricsWithR)

data(iris)

Or read the IRIS csv-data which is a copy of the file uploaded on CampusNet. Note that the Irisdata given in CampusNet is slightly different from the IRIS data available. First save the dataset on your computer and set the relevant working direcctory in R, e.g. by clikcing ’Session’ andchoosinf ’Set working directory’, or run the following command with the correct chosen folderpath:

setwd("C:/myfolderpath")

And then import the data into R as follows:

JCFiris=read.table("Fisher_JCF.csv",header=T,sep=";",dec=",")

Note that the Iris data given by JCF is slightly different from the IRIS data available in R:

summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1

1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3

Median :5.80 Median :3.00 Median :4.35 Median :1.3

Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2

3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8

Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5

Species

setosa :50

versicolor:50

virginica :50

summary(JCFiris)

X PW PL SW

setosa :50 Min. : 1.0 Min. :10.0 Min. :20.0

versicolor:50 1st Qu.: 3.0 1st Qu.:16.0 1st Qu.:28.0

virginica :50 Median :13.0 Median :44.0 Median :30.0

Mean :11.9 Mean :37.8 Mean :30.6

3rd Qu.:18.0 3rd Qu.:51.0 3rd Qu.:33.0

Max. :25.0 Max. :69.0 Max. :44.0

Min. : 43.0

1st Qu.: 51.0

Median : 58.0

Mean : 62.6

3rd Qu.: 64.0

Max. :699.0

Note the differences: The names, order and scales. AND: an outlier in the JCF-version has beenchanged in the R-version. Look at the first 6 observations:

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

head(JCFiris)

X PW PL SW SL

1 setosa 2 14 33 50

2 virginica 24 56 31 67

3 virginica 23 51 31 69

4 setosa 2 10 36 46

5 virginica 20 52 30 65

6 virginica 19 51 27 58

The dimensions are the same:

dim(iris)

[1] 150 5

dim(JCFiris)

[1] 150 5

2.2.2 Basic explorative analysis

First we do some classic (univariate) explorative analysis:

# 4 boxplots with color:

par(mar=c(4,2,3,2),mfrow=c(2,2))

for (i in 1:4) boxplot(iris[,i] ~ iris[,5],

col = 1:3, main = names(iris)[i])

setosa versicolor virginica

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

The par(mar=c(4,2,3,2)) command controls the four margins of each individual plot in theorder: bottom, left, top, right. This is helpful to make nice multi-plot pages.

# Pairwise scatters:

pairs(iris,col = iris$Species)

Sepal.Length

2.0 2.5 3.0 3.5 4.0

●●

●● ●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●● ●

●●●

●●

●●●

●●

0.5 1.0 1.5 2.0 2.5

●●●●

●●

●● ●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●●●●

●●

●●●

●●

●●●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

● ●

●●

●●●

●●

●●●

●●

● ●

●●

●● ●

●●

● ●

● ●●

●●

●● ●

●●

●Sepal.Width

●●

●●●

●●

●●●

●●

● ●

●●

●●●

●●

● ●

● ●●

●●

●●●

●●

●●●

●●

●●●

●●

● ●

●●

●●●

●●

●●●

●●

● ●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●●●●

●●

●●●● ●

●● ●● ● ●●

●● ●

●●●

●●

●● ●●●● ●● ●●

● ●●●●

●●●●●

●●

● ●●

●●

●●●

●●

● ●

●●

●●●

● ●●

●●●

●●

●●● ●

●●

●●●●

●●

●● ●● ●

●●●● ● ●●

●● ●

●●●

●●

● ●●●●● ● ●●●● ●●●

●●● ●●

●●

● ●●

●●

●●●

●●

●●●

● ●●

●●●

●●

● ●●●

●●

● ●●

●●

● ●

●●

Petal.Length

●●●●●

●●●●●●●●

●●●●●

●●

●●● ●●●●● ●●●●●●

●●●

●●●●

●●●●●

●●

●●●

●●

●●●

●●

●●●●

●●

● ●●

●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●

●●●●

●●●●●●

●●

●●●●

●●●

●●

●●●

●●●●

●●

●●●●

●●

●●●●●

●●●● ●

●●

●●●

●●●●

●●● ●●

● ●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●● ●

●●

● ●

●●●

●●

●●●●

●●● ●

●●

●●●

●●

● ●

●●●

●●

●● ●● ●

●●●●

●●●

●●● ●●

●●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●●●

●●

●●●

●●

● ●●●

●●

●●●

●●

●● ●

●●

●●●●●

●●●●●●●

●●●

●●● ●●

●●

●●●●

●●●●●●

●●●●●

●●

●●●●

●● ●

●●

●●● ●

●●

●●●

●●●●

●●

●●●

●●

● ●

●●●

●●

Petal.Width

●●●●●

●●●●●●●●●●

●●●●●●

●●

●●●●

●●●●●●●●●●●

●●●●●●

●●●

●●

●●●●

●●●

●●●●

●●

●●●

●●

●●●

●●

4.5 5.5 6.5 7.5

●●●● ● ●● ●● ● ●●●● ●●●● ●● ●●● ●● ●● ●●●● ●● ●●● ●●● ●●●● ●●● ●● ●●

●● ●● ●● ●● ●●● ●●●● ●● ● ●● ● ● ●● ● ● ●●●●●● ● ●● ● ●●●●● ●●● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●●● ●● ●●● ●● ●● ● ●●● ● ● ● ●●●● ●●●● ●● ●● ●●●● ●●●

●● ●● ● ●●●● ● ●●●● ● ●●● ●●● ●●●●● ●●●●● ● ●●●● ●●● ●●● ● ● ●● ●● ●●

●●●● ●● ●● ●●● ●● ●● ●●●● ● ●●● ●●●● ●●●●● ●● ● ●●● ●●● ●●● ● ●●●● ●

●● ●●●●● ●● ●●● ●● ● ●● ●●● ●●●● ●●● ●● ●● ●●●● ● ●●●●●●● ●●●● ● ●●

1 2 3 4 5 6 7

●●●●● ●●●●●●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●

●● ●● ●●●● ●●● ●● ●● ●●● ●● ●● ●●●● ●●●● ●●● ●●●●●●● ●●●● ●●●●● ●

●● ●●● ●● ●● ●●●●●●●● ●●● ●● ●● ● ●●● ●● ● ●●● ● ●●●● ●●●● ●●●●●●●

●●●●● ●●●●●●●●●● ●●●●●● ●● ●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●

●●●● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●● ●●●●● ● ●●●●●●●● ●●● ●●●●● ●

●● ●● ●●●●● ●●● ●● ●●● ●●● ●●●● ●●●● ●● ●● ●●● ●●●● ● ●●● ● ●●●● ●●

1.0 1.5 2.0 2.5 3.0

Species

Let us, for the record, have a look at the covariance matrix:

cov(iris[,1:4])

And similarly the correlation matrix:

cor(iris[,1:4])

Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 0.69 -0.04 1.27 0.52Sepal.Width -0.04 0.19 -0.33 -0.12Petal.Length 1.27 -0.33 3.12 1.30Petal.Width 0.52 -0.12 1.30 0.58

Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 1.00 -0.12 0.87 0.82Sepal.Width -0.12 1.00 -0.43 -0.37Petal.Length 0.87 -0.43 1.00 0.96Petal.Width 0.82 -0.37 0.96 1.00

2.2.3 PCA of Iris data

First we do a basic PCA on covariances (WITHOUT Standardization - ONLY with centering):(and here using the PCA function of the ChemometricsWithR-package)

irisPC_without=PCA(scale(iris[,1:4], scale = FALSE))

Note that the scale-function is used here to just center the four variables.

# A good selection of 4 core plots:

scoreplot(irisPC_without, col = iris$Species, main = "Scores")

loadingplot(irisPC_without, show.names = TRUE, main = "Loadings")

biplot(irisPC_without, score.col = iris$Species, main = "biplot")

screeplot(irisPC_without, type = "percentage", main = "Explained variance")

−3 −2 −1 0 1 2 3 4

Scores

PC 1 (92.5%)

●●

● ●

●●

●● ●

●●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0−

Loadings

PC 1 (92.5%)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−3 −2 −1 0 1 2 3 4

biplot

PC 1 (92.5%)

●●●

●●

●●●

● ●●

●●

●●●●

●●

●● ● ●

●●

●●●

●●

● ●

●●

●● ●

● ●

●●

●●●

●●

●●●

●●

−1 0 1 2

1 2 3 4

Explained variance

And now the PCA on correlations (WITH Standardization - AND with centering):

irisPC <- PCA(scale(iris[,1:4]))

Note that the scale-function now is used to both center and standardize the four variables - hedefault choice of this function.

scoreplot(irisPC, col = iris$Species, main = "Scores")

loadingplot(irisPC, show.names = TRUE, main = "Loadings")

biplot(irisPC, score.col = iris$Species, main = "biplot")

screeplot(irisPC, type = "percentage", main = "Explained variance")

−3 −2 −1 0 1 2 3

Scores

PC 1 (73.0%)

●●

● ●

●●

● ●

●●

● ●●

−0.2 0.0 0.2 0.4 0.6

Loadings

PC 1 (73.0%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−3 −2 −1 0 1 2 3

biplot

PC 1 (73.0%)

●●

● ●

●●

● ●

●●

● ●●

●●

−1.0 −0.5 0.0 0.5 1.0

1 2 3 4

Explained variance

There can be other versions of the variance plot, e.g.:

par(mfrow=c(1,2))

plot(1:length(irisPC$var), irisPC$var, cex = 2,

ylab = "variance explained",xlab = "n PC")

lines(1:length(irisPC$var), irisPC$var)

plot(1:length(irisPC$var), irisPC$var/sum(irisPC$var), cex = 2,

ylab = "(explained variance)/(total variance)",xlab = "n PC")

lines(1:length(irisPC$var), irisPC$var/sum(irisPC$var))

1.0 1.5 2.0 2.5 3.0 3.5 4.0

It can be useful to plot more components than just the first two:

# Scores:

pairs(scores(irisPC), col = iris$Species)

−2 −1 0 1 2

●●

● ●●

●● ●

●●● ●

●● ● ●●

●●

●●●

● ●●

●●

●●●

● ●●

●● ●

● ●● ●

●●

●●●

●●

●●●

●●

● ●

●●●

●●

● ●

●●

●●●

●●

●● ●

●●●

●●● ●

●●●●●

●●

●●●

●●

●●●

●● ●

●●●●

●●

● ●●

●●

●● ●●

●●

● ●

●●●

●●

●●●

●●

● ●

●●

●● ●

●●●

−0.4 −0.2 0.0 0.2 0.4

●●

●●●

● ●● ●

●● ●●●

●●

●●●

● ●●

●●

● ●●

●● ●

● ●● ●

●●

● ●●

●●

●●●

●●

● ●

●●●

●●

●● ●

● ●●

●●●

●●

● ●

●●

● ●

●●

●●●

●●

● PC 2●

●●

● ●

●●

● ●

●●

● ●

●●

●● ●

●●

● ● ●

●●

● ●

●●

●●●

●●

● ●●

●●

● ●

●●

● ●●●

● ●

●●

● ●

●●

●●●

●●

● ●

●●

●●● ●

●●

● ●

●●

● ●

●●

● ●●

●●

● ●

●●

● ●

●●

●●● ●

● ●

●●

● ●

●●

−3 −2 −1 0 1 2 3

●●

●●●●●

●●

● ●

●●

● ●● ●

●●

●● ●●

●●●

● ●

●●

● ●●

●●

●●● ●●

●●

●●●●

●●

●●● ●

●● ●

●●

● ●

●●

● ●

●● ●

●●

● ●

●●

−1.0 −0.5 0.0 0.5

●●●

● ●● ●

●●

● ●

●●

● ●● ●

●●

●● ●●

●● ●

● ●

●●

●● ●

●●

● ●

●●

# Loadings:

par(mfrow = c(4,4), mar = c(4,4,.1,.1))

for (i in 1:4) for (j in 1:4) loadingplot(irisPC,

show.names = TRUE,pc=c(i,j), cex.lab=0.7)

−0.2 0.0 0.2 0.4 0.6

PC 1 (73.0%)

Sepal.Length

Sepal.Width

−0.2 0.0 0.2 0.4 0.6−

PC 1 (73.0%)

Sepal.Length

Sepal.Width

−0.2 0.0 0.2 0.4 0.6

PC 1 (73.0%)

Sepal.Length

Sepal.WidthPetal.Length

Petal.Width

−0.2 0.0 0.2 0.4 0.6

PC 1 (73.0%)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−1.0 −0.6 −0.2

PC 2 (22.9%)

Sepal.Length

Sepal.Width

−1.0 −0.6 −0.2

PC 2 (22.9%)

Sepal.Length

Sepal.Width

−1.0 −0.6 −0.2−

PC 2 (22.9%)

Sepal.Length

Petal.Width

−1.0 −0.6 −0.2

PC 2 (22.9%)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−0.5 0.0 0.5

PC 3 (3.7%)

Sepal.Length

Sepal.Width

−0.5 0.0 0.5

PC 3 (3.7%)

Sepal.Length

Sepal.Width

−0.5 0.0 0.5

PC 3 (3.7%)

Sepal.Length

Petal.Width

−0.5 0.0 0.5−

PC 3 (3.7%)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−1.0 −0.5 0.0 0.5

PC 4 (0.5%)

Sepal.Length

Sepal.Width

Petal.Length Petal.Width

−1.0 −0.5 0.0 0.5

PC 4 (0.5%)

Sepal.Length

Sepal.Width

−1.0 −0.5 0.0 0.5

PC 4 (0.5%)

Sepal.Length

Petal.Width

−1.0 −0.5 0.0 0.5

PC 4 (0.5%)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

A much nicer biplot can be created by the ggbiplot-package: (Now using the prcomp-function todo the PCA)

ir.pca <- prcomp(iris[,1:4],

center = TRUE,

scale. = TRUE)

library(devtools)

# First time install: install_github("ggbiplot", "vqv")

library(ggbiplot)

g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1,

groups = iris[,5], ellipse = TRUE,

circle = FALSE)

print(g)

●●

● ●

●●

● ●

●●

Sepal.Length

−2 0 2PC1 (73.0% explained var.)

groups

setosa

versicolor

virginica

Generally about interpreting PCA plots:

• Look at variances (scree) - hope for few(2) - look for the ”bend”

• Look at scores and loadings (e.g. biplot)

– Scores: OBSERVATION mapping

∗ preserves inter observation distances (as good as possible)

– Loadings: VARIABLE mapping (correlation structure)

∗ Variables in the SAME DIRECTION from (0,0) AND far away from (0,0) arehighly correlated

– Loadings tell us on which variables the observations differ

– An observation to the right has high values on the variables with (large) loadings tothe right

– An observation to the left has low values on the variables with (large) loadings to theright

• Look at residuals (Orthogonal distances) and leverages (score distances) (Outliers etc)

Finally, let us show some of the diagnostics (residuals) plotting. For this we will use the chemo-

metrics package: (and now the princomp function for the PCA)

library(chemometrics)

irisPCA <- princomp(iris[,1:4], cor = TRUE)

# The score distances res£SDist express the leverage values

# The orthogonal distances express the residuals

## Plots vs object number :

res <- pcaDiagplot(iris[,1:4], irisPCA, a = 2)

●●●●

●●

●●●

0 50 100 150

Object number

●●

●●●

●●

0 50 100 150

Object number

## Plot of the two agains each other:

par(mfrow=c(1,2))

plot(res$SDist, res$ODist, type = "n")

text(res$SDist, res$ODist, labels = row.names(iris))

## Explained variance for each variable

pcaVarexpl(iris[,1:4],a=2)

0.5 1.0 1.5 2.0 2.5 3.0

res$SDist

46 474849

109 110

120121

122123

130131

141142143

147148

Sepal.Length Petal.Length

# Influence plot: residuals versus leverage

# for different number of components:

par(mfrow=c(2,2))

for (i in 1:4) {res=pcaDiagplot(iris[,1:4],a=i,irisPCA,plot=FALSE)

plot(res$SDist,res$ODist,type="n")

text(res$SDist,res$ODist,labels=row.names(iris))

0.0 0.5 1.0 1.5 2.0

res$SDist

2829303132

108109

112 113

114115

121122 123

125126

127128 129

130131

133134

135136

138139

140141142143 144

0.5 1.0 1.5 2.0 2.5 3.0

res$SDistre

383940

46 47484950

103104

109 110

120121

122123

128129

130131

141142143

147148

0.5 1.0 1.5 2.0 2.5 3.0

res$SDist

678 910

444546 47

50 5152 53

57585960 61

676869

818283

100 101102103

106107

109110111

116117

119120

123124

128129

143144

0.5 1.0 1.5 2.0 2.5 3.0 3.5

res$SDist

252627282930

363738

4344 45

767778

858687

969798

101102

104105

106107108

109110

112113 114 115

116117

118119120

124 125

127128

129130

137138

140141

142143

145 146

Finally, finally let us indicate how one could do some ’re-sampling’ (similar to ”jacknifing”):Leaving out a certain number of the observation and plotting the loadings and/or scores foreach subset data. First the loadings:

# Random samples of a certain proportion of the

# original number of observations are left out

par(mar = c(1,1,1,1), mfrow = c(3,3))

n=length(iris[,1])

leave_out_size=0.50

for (k in 1:9){irisPC=PCA(scale(iris[sample(1:n,round(n*(1-leave_out_size))),1:4]))

loadingplot(irisPC, show.names = TRUE, main = "Loadings")

−0.2 0.0 0.2 0.4 0.6

Loadings

PC 1 (72.8%)

Sepal.Length

Sepal.Width

−0.6 −0.4 −0.2 0.0 0.2 0.4

1.0 Loadings

PC 1 (73.2%)

Sepal.Length

Sepal.Width

−0.6 −0.4 −0.2 0.0 0.2

Loadings

PC 1 (71.9%)

%) Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−0.6 −0.4 −0.2 0.0 0.2

Loadings

PC 1 (72.3%)

Sepal.Length

Sepal.Width

−0.6 −0.4 −0.2 0.0 0.2 0.4

1.0 Loadings

PC 1 (74.4%)

Sepal.Length

Sepal.Width

−0.2 0.0 0.2 0.4 0.6

Loadings

PC 1 (73.1%)

Sepal.Length

Sepal.Width

−0.6 −0.4 −0.2 0.0 0.2

Loadings

Sepal.Length

Sepal.Width

−0.6 −0.4 −0.2 0.0 0.2 0.4

Loadings

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−0.2 0.0 0.2 0.4 0.6

Loadings

Sepal.Length

Sepal.Width

The the scores:

par(mar = c(1,1,1,1), mfrow = c(3,3))

for (k in 1:9){subsample <- sample(1:n,round(n*(1-leave_out_size)))

irisPC <- PCA(scale(iris[subsample,1:4]))

scoreplot(irisPC, col = iris$Species[subsample], main = "Scores")

eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 24

−3 −2 −1 0 1 2

Scores

PC 1 (75.4%)

●●

●●●

●●

−3 −2 −1 0 1 2 3

Scores

PC 1 (74.5%)

● ●

●●

● ●●

●●

−3 −2 −1 0 1 2

Scores

PC 1 (71.6%)

●●

●● ●

−2 −1 0 1 2 3

Scores

PC 1 (70.4%)

●●

● ● ●

●●

−2 −1 0 1 2

Scores

PC 1 (69.1%)

●●

● ●

●●

−3 −2 −1 0 1 2 3−

Scores

PC 1 (74.0%)

●●

●●●

●●

● ●

●●●

●●

−3 −2 −1 0 1 2

Scores

●●

−3 −2 −1 0 1 2

Scores

●●

●●●

−3 −2 −1 0 1 2

2Scores

●●

● ●

●● ●

●●

● ●●

●●

● ●

●●

● ●●

●●

The choice of showing 9 is arbitrary. Other plots of this re-sampling type could be thought of.

2.3 Spectral data example: yarn data

## Spectral data,

data(yarn) # Part of chemometrics package

# Try: ?yarn

dim(yarn$NIR)

## [1] 28 268

par(mfrow = c(2, 2), mar = c(4, 4, .2, .2))

# Plotting of the 21 individual NIR spectra"

max_X=max(yarn$NIR)

min_X=min(yarn$NIR)

plot(yarn$NIR[1,],type="n",ylim=c(min_X,max_X))

for (i in 1:21) lines(yarn$NIR[i,],col=i)

# Plotting of the 21 individual NIR spectra - centered"

max_X=max(scale(yarn$NIR,scale=F))

min_X=min(scale(yarn$NIR,scale=F))

plot(scale(yarn$NIR[1,],scale=F),type="n",ylim=c(min_X,max_X))

for (i in 1:21) lines(scale(yarn$NIR,scale=F)[i,],col=i)

# Plotting of the 21 individual NIR spectra - centered and scaled"

max_X=max(scale(yarn$NIR))

min_X=min(scale(yarn$NIR))

plot(scale(yarn$NIR[1,]),type="n",ylim=c(min_X,max_X))

for (i in 1:21) lines(scale(yarn$NIR)[i,],col=i)

# Plotting of the principal variances: "

yarnPC <- PCA(scale(yarn$NIR))

plot(1:length(yarnPC$var),yarnPC$var,cex=2)

lines(1:length(yarnPC$var),yarnPC$var)

0 50 100 150 200 250

, ], s

0 50 100 150 200 250

●●●●●●●●●●●●●●●●●●●●●●●●●

0 5 10 15 20 25

1:length(yarnPC$var)

# Plot of y:

plot(yarn$density,type="n")

lines(yarn$density)

eNote 2 2.4 EXERCISES 27

0 5 10 15 20 25

2.4 Exercises

Exercise 1 Fisher’s Iris data

First examine the raw data and examine whether there are obvious mistakes. After that onecould use other Unscrambler features to examine the statistical properties of the objects andvariable, but it in this case we go directly to PCA, as this give a very fine overview of the data,and will often show outliers immediately. Perform the PCA with leverage correction and withcentering. Examine the four standard plots (score plot, loading plot, influence plot and explainedvariance plot).

a) How many principal components would you need and what does the first PC (PC1)describe?

b) How many percentage of the variation is described by the first two PCs?

c) Can you find an outlier? It so do you have an idea why thus outlier came about? (loadingsplot or scores plot)? In R: Do you see problem in the influence plot. If there is an outlier, inwhich other plot can you see the problem? If you see severe outliers, remove them fromthe data and run PCA again (and answer a, and b, again)

d) Does a standardization (autoscaling) give a better model? (answer a) and b) again)

e) How many PCs are needed to explain 70%, 75% and 90% of the variation in the data?

f) How many PCs can you maximally get in this dataset?

g) Compare the score and the loading plot, and make a biplot. Do any of the variables ”tellthe same story”?

h) Are any variables more discriminative the others? Are any variables dispensable?

i) Can you see the presupposed classes? Any class overlap?

j) Does the original hypothesis seem to be OK?

Exercise 2 Wine Data (To be presented by Team 1 next time)

The second dataset is called VIN2:

• Forina, M., Armanino, C., Castino, M. and Ubigli, M. (1986). Multivariate data analysis asa discriminating method of the origin of wines. Vitis 25: 189-201.

• Forina, M., Lanteri, S., Armanino, C., Casolino, C. and Casale, M. 2010. V-PARVUS. Anextendable package of programs for data exploration, classification, and correlation. (www.parvus.unige.it)

The dataset VIN2.csv is an Excell CSV file. In this dataset there are 178 objects (Italian wines), thefirst 59 are Barolo wines (B1-B59), the next 71 are Grignolino wines (G60-G130) and the last 48are Barbera wines (S131-S178). These wines have been characterized by 13 variables (chemicaland physical measurements):

1. Alcohol (in %)

2. Malic acid

3. Ash

4. Alkalinity of Ash

5. Magnesium

6. Total phenols

7. Flavanoids

8. Nonflavanoid phenols

9. Proanthocyanins

10. Colour intensity

11. Colour hue

12. OD280 / OD315 of diluted wines

13. Proline (amino acid)

The wine data can allready be found within R, so no import is needed:

# Wines data:

# From the JCF uploaded file:

# Also slightly different from the version in the package

JCFwines=read.table("VIN2.csv",header=T,sep=";",dec=",")

# The wines data from the package:

# The wine class information is here stored in the wine.classes object

data(wines, package = "ChemometricsWithRData")

head(wines)

alcohol malic acid ash ash alkalinity magnesium tot. phenols

[1,] 13.20 1.78 2.14 11.2 100 2.65

[2,] 13.16 2.36 2.67 18.6 101 2.80

[3,] 14.37 1.95 2.50 16.8 113 3.85

[4,] 13.24 2.59 2.87 21.0 118 2.80

[5,] 14.20 1.76 2.45 15.2 112 3.27

[6,] 14.39 1.87 2.45 14.6 96 2.50

flavonoids non-flav. phenols proanth col. int. col. hue OD ratio

[1,] 2.76 0.26 1.28 4.38 1.05 3.40

[2,] 3.24 0.30 2.81 5.68 1.03 3.17

[3,] 3.49 0.24 2.18 7.80 0.86 3.45

[4,] 2.69 0.39 1.82 4.32 1.04 2.93

[5,] 3.39 0.34 1.97 6.75 1.05 2.85

[6,] 2.52 0.30 1.98 5.25 1.02 3.58

proline

[1,] 1050

[2,] 1185

[3,] 1480

[4,] 735

[5,] 1450

[6,] 1290

head(JCFwines)

X Wine F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12

1 S1 Barolo 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92

2 S2 Barolo 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40

3 S3 Barolo 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17

4 S4 Barolo 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45

5 S5 Barolo 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93

6 S6 Barolo 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85

1 1070

2 1050

3 1190

4 1480

6 1450

summary(wines)

alcohol malic acid ash ash alkalinity

Min. :11.0 Min. :0.74 Min. :1.36 Min. :10.6

1st Qu.:12.4 1st Qu.:1.60 1st Qu.:2.21 1st Qu.:17.2

Median :13.1 Median :1.87 Median :2.36 Median :19.5

Mean :13.0 Mean :2.34 Mean :2.37 Mean :19.5

Max. :14.8 Max. :5.80 Max. :3.23 Max. :30.0

magnesium tot. phenols flavonoids non-flav. phenols

Min. : 70.0 Min. :0.98 Min. :0.34 Min. :0.130

1st Qu.: 88.0 1st Qu.:1.74 1st Qu.:1.20 1st Qu.:0.270

Median : 98.0 Median :2.35 Median :2.13 Median :0.340

Mean : 99.6 Mean :2.29 Mean :2.02 Mean :0.362

Max. :162.0 Max. :3.88 Max. :5.08 Max. :0.660

proanth col. int. col. hue OD ratio

Min. :0.41 Min. : 1.28 Min. :0.480 Min. :1.27

1st Qu.:1.25 1st Qu.: 3.21 1st Qu.:0.780 1st Qu.:1.93

Median :1.55 Median : 4.68 Median :0.960 Median :2.78

Mean :1.59 Mean : 5.05 Mean :0.957 Mean :2.60

3rd Qu.:1.95 3rd Qu.: 6.20 3rd Qu.:1.120 3rd Qu.:3.17

Max. :3.58 Max. :13.00 Max. :1.710 Max. :4.00

proline

Min. : 278

1st Qu.: 500

Median : 672

Mean : 745

3rd Qu.: 985

Max. :1680

summary(JCFwines)

X Wine F1 F2 F3

S1 : 1 Barbera:48 Min. : 3.67 Min. :0.74 Min. :1.36

S10 : 1 Barolo :59 1st Qu.:12.35 1st Qu.:1.60 1st Qu.:2.21

S100 : 1 Grigno :71 Median :13.05 Median :1.86 Median :2.36

S101 : 1 Mean :12.94 Mean :2.34 Mean :2.37

S102 : 1 3rd Qu.:13.67 3rd Qu.:3.08 3rd Qu.:2.56

S103 : 1 Max. :14.83 Max. :5.80 Max. :3.23

(Other):172

F4 F5 F6 F7

Min. :10.6 Min. : 70.0 Min. :0.98 Min. :0.34

1st Qu.:17.2 1st Qu.: 88.0 1st Qu.:1.74 1st Qu.:1.21

Median :19.5 Median : 98.0 Median :2.35 Median :2.13

Mean :19.5 Mean : 99.7 Mean :2.30 Mean :2.03

Max. :30.0 Max. :162.0 Max. :3.88 Max. :5.08

F8 F9 F10 F11

Min. :0.130 Min. :0.41 Min. : 1.28 Min. :0.480

1st Qu.:0.270 1st Qu.:1.25 1st Qu.: 3.22 1st Qu.:0.782

Median :0.340 Median :1.55 Median : 4.69 Median :0.965

Mean :0.362 Mean :1.59 Mean : 5.06 Mean :0.957

3rd Qu.:0.438 3rd Qu.:1.95 3rd Qu.: 6.20 3rd Qu.:1.120

Max. :0.660 Max. :3.58 Max. :13.00 Max. :1.710

F12 F13

Min. :0.56 Min. : 278

1st Qu.:1.92 1st Qu.: 500

Median :2.78 Median : 674

Mean :2.59 Mean : 753

3rd Qu.:3.17 3rd Qu.: 989

Max. :4.00 Max. :1940

a) Examine the raw data. Are there any severe outliers you can detect? What do you thinkhappened with the outlier, if any?

b) Correct wrong data, if any (in the excel file), and use PCA again. Does the score and loa-ding plot look significantly different now?

c) Try PCA without standardization: Which variables are important here and why?

d) Try PCA with standardization. Which variables are important here, and would you recom-mend removing any of them from the data set? Which variables are especially importantfor the Barbera wines?

e) Suppose that alcohol % and proanthocyanins were especially healthy which wine wouldyou recommend?

f) Use some re-sampling/jack-knifing methods to test for significance of the variable - are allthe variables stable?

principal component analysis, pca, in r · pca is the outcome of (a version of) the nipals...

Documents

jerry bruckheimer and isla fisher’s 'confessions of a

fisher’s 15 point checklist aero grow (aero)

pca-6179 - advantechdownloadt.advantech.com › ... ›...

a visualization of fisher’s least significant difference...

pca and mixtures of pca: improving the robustness to...

principal component analysis, pca, in r€¦ · enote 2 2.2...

equal protection and affirmative action; fisher’s inapt...

fisher’s linear disc

fisher’s geometrical model of fitness landscape...

created by mrs. fisher’s class

evolution, dispersal of genetics and fisher’s equation

fisher’s combined probability test for high-dimensional...

a theory of fisher’s reproductive value - university of...

course5 ||||| linear discriminant analysis · a.b. dufour 1...

welcome to mrs. rivera’s/ ms. fisher’s classroom

full product line catalogue 2018 3409 · 79 80 pca-m35ka...

statistics for ees chi-square tests and fisher’s exact...

evolution, dispersal of genetics and fisher’s equation

fisher’s equation of exchange

stat 301 – day 9 fisher’s exact test quantitative...