principal component analysis, pca, in r · pca is the outcome of (a version of) the nipals...
Post on 21-Jan-2019
221 Views
Preview:
TRANSCRIPT
eNote 2 1
eNote 2
Principal Component Analysis, PCA, in R
eNote 2 INDHOLD 2
Indhold
2 Principal Component Analysis, PCA, in R 12.1 Reading about PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Example: Fisher’s Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Basic explorative analysis . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 PCA of Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Spectral data example: yarn data . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Reading about PCA
You can use the Wehrens book, Chapter 4, pp 43-56:
http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/
page/1
and/or (probably better) the Varmuza-book, chapter 3, sections 3.1 - 3.7:
http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2
The two R-packages chemometrics and ChemometricswithR, are companions to the twobooks.
Bro and Smilde (2014): Principal Component Analysis Analytical Methods TUTORIALREVIEW, 6, 2812.http://pubs.rsc.org/en/content/articlepdf/2014/ay/c3ay41907j
eNote 2 2.1 READING ABOUT PCA 3
Below there will be a number of important plots examplified as part of the iris-example:
1. Variance-plots (”scree-type”plots)
2. Scores, loadings and biplots (main plots for interpretation of structure)
3. Explained variances for each variable
4. Validation/diagnostics plots:
(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.the nice Figure 3.15, page 79 in the Varmuza-book)
(b) The ”influence plot”: residuals versus leverage
5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:
(a) Deciding on number of components
(b) Sensitivity/uncertainty investigation of scores and loadings.
What is PCA: Developed by Karl Pearson in 1901:
Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. PhilosophicalMagazine (6) 2: 559-572.
eNote 2 2.1 READING ABOUT PCA 4
May also be called:
• Singular value decomposition
• Karhunen-Loeve expansion
• Eigenvector analysis
• Latent vector analysis
• Characteristic vector analysis
PCA is used for many things:
• Projection method
• Exploratory data analysis
• Extract information and remove noise
• Reduce dimensionality / Compression
• (Clustering)
And can be described/expressed in many ways:
eNote 2 2.1 READING ABOUT PCA 5
• Produces optimal low-dimensional plots of observations (scores)
• Provides an overview of the variable correlation structure (loadings)
• Finds linear combinations of maximal variance
• Orthogonal distance regression method
• A bilinear model for the data
And can be described/expressed in many ways:
X : The (centered and scaled) n× p− data matrix
X = Observation Scores×Variable Loadings + Error
X = TPT + E
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 6
Xij =A
∑a=1
tia paj + eij
Computations/A bit of math:
• PCA finds X-components with maximal Y-variance:
max||α||=1
Var(Xα)
• PCA is the least squares fit of the bilinear (non linear regression) model:
mint,p ∑
ij(xij −
A
∑a=1
tia paj)2
• PCA is the eigen decomposition of XtX
• PCA is the eigen decomposition of XXt
• PCA is the outcome of (a version of) the NIPALS algorithm
2.2 Example: Fisher’s Iris Data
Below there will be an exercise based on these data with some questions that PCA can be helpfulin answering. Here we examplify a number of visualizations that one could do for such dataincluding PCA-based stuff.
The Fisher Iris data-set is classic, c.f.:
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 7
• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals ofEugenics 7: 179-188.
• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Socie-ty 59: 2-5.
There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW),petal length (PL) and petal width PW), all in all only four variables.
The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosax virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.
2.2.1 Data import
The iris data can allready be found within R, so no import is needed:
# Loading package related to Varmuza-book
# (First time you need to install the package)
library(ChemometricsWithRData)
library(ChemometricsWithR)
data(iris)
Or read the IRIS csv-data which is a copy of the file uploaded on CampusNet. Note that the Irisdata given in CampusNet is slightly different from the IRIS data available. First save the dataset on your computer and set the relevant working direcctory in R, e.g. by clikcing ’Session’ andchoosinf ’Set working directory’, or run the following command with the correct chosen folderpath:
setwd("C:/myfolderpath")
And then import the data into R as follows:
JCFiris=read.table("Fisher_JCF.csv",header=T,sep=";",dec=",")
Note that the Iris data given by JCF is slightly different from the IRIS data available in R:
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 8
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
Median :5.80 Median :3.00 Median :4.35 Median :1.3
Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
Species
setosa :50
versicolor:50
virginica :50
summary(JCFiris)
X PW PL SW
setosa :50 Min. : 1.0 Min. :10.0 Min. :20.0
versicolor:50 1st Qu.: 3.0 1st Qu.:16.0 1st Qu.:28.0
virginica :50 Median :13.0 Median :44.0 Median :30.0
Mean :11.9 Mean :37.8 Mean :30.6
3rd Qu.:18.0 3rd Qu.:51.0 3rd Qu.:33.0
Max. :25.0 Max. :69.0 Max. :44.0
SL
Min. : 43.0
1st Qu.: 51.0
Median : 58.0
Mean : 62.6
3rd Qu.: 64.0
Max. :699.0
Note the differences: The names, order and scales. AND: an outlier in the JCF-version has beenchanged in the R-version. Look at the first 6 observations:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 9
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
head(JCFiris)
X PW PL SW SL
1 setosa 2 14 33 50
2 virginica 24 56 31 67
3 virginica 23 51 31 69
4 setosa 2 10 36 46
5 virginica 20 52 30 65
6 virginica 19 51 27 58
The dimensions are the same:
dim(iris)
[1] 150 5
dim(JCFiris)
[1] 150 5
2.2.2 Basic explorative analysis
First we do some classic (univariate) explorative analysis:
# 4 boxplots with color:
par(mar=c(4,2,3,2),mfrow=c(2,2))
for (i in 1:4) boxplot(iris[,i] ~ iris[,5],
col = 1:3, main = names(iris)[i])
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 10
●
setosa versicolor virginica
4.5
5.5
6.5
7.5
Sepal.Length
●
setosa versicolor virginica
2.0
2.5
3.0
3.5
4.0
Sepal.Width
●
●
setosa versicolor virginica
12
34
56
7
Petal.Length
●
●
setosa versicolor virginica
0.5
1.0
1.5
2.0
2.5
Petal.Width
The par(mar=c(4,2,3,2)) command controls the four margins of each individual plot in theorder: bottom, left, top, right. This is helpful to make nice multi-plot pages.
# Pairwise scatters:
pairs(iris,col = iris$Species)
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 11
Sepal.Length
2.0 2.5 3.0 3.5 4.0
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●● ●
●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●●
●
●●●
●●
●
●
●
●
●● ●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●
●
0.5 1.0 1.5 2.0 2.5
●●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●● ●●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●
●
4.5
5.5
6.5
7.5
●●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●●●●
●●●●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●
●●●
●
●●●
●●
●
●
2.0
2.5
3.0
3.5
4.0
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
●
●●
●
●
●
●
●Sepal.Width
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
● ●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●●●●●
●
●●
●
●
●
●
●
●●●● ●
●● ●● ● ●●
●● ●
●●●
●●
●●
●
●●
●● ●●●● ●● ●●
● ●●●●
●●●●●
●●
● ●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
● ●
●●
●
●●●
●
●
● ●●
●●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●●●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
●● ●● ●
●●●● ● ●●
●● ●
●●●
●●
●●
●
●●
● ●●●●● ● ●●●● ●●●
●●● ●●
●
●●
● ●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●●
●
●●●
●
●
● ●●
●●●
●●
●
●
● ●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
Petal.Length
●●●●●
●●●●●●●●
●●●●●
●●
●●
●
●●● ●●●●● ●●●●●●
●●●
●●●●
●
●●●●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●●
●
●●●
●
●
●●●
●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
12
34
56
7
●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●●
●
●●●●
●
●●●●●●
●●
●
●
●●●●
●
●
●
●
●●●
●
●
●
●●
●●●
●●●●
●●
●
●
●
●
●
●●
●●
●●●●
●
●
●
●
●●
●
●●
●●
●●
●●●●●
0.5
1.0
1.5
2.0
2.5
●●●● ●
●●
●●●
●●●●
●
●●● ●●
●
●
●
●
● ●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●●
●
●●
●●
●●●●
●
●
●
●●● ●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
●● ●● ●
●●●●
●●●
●●●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●●
● ●●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●● ●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
●●●●●
●●●●●●●
●●●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●●●●
●●●●●
●
●●
●●●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●● ●
●
●
●●
●
●
●●●
●
●●●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
Petal.Width
●●●●●
●●●●●●●●●●
●●●●●●
●
●
●
●●
●
●●●●
●
●●●●●●●●●●●
●
●●●●●●
●●●
●
●
●
●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●●●●
●
●
●●●
●
●●●●
●●●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
4.5 5.5 6.5 7.5
●●●● ● ●● ●● ● ●●●● ●●●● ●● ●●● ●● ●● ●●●● ●● ●●● ●●● ●●●● ●●● ●● ●●
●● ●● ●● ●● ●●● ●●●● ●● ● ●● ● ● ●● ● ● ●●●●●● ● ●● ● ●●●●● ●●● ●●● ●● ●
●● ●● ● ●● ●● ●●● ●●● ●● ●●● ●● ●● ● ●●● ● ● ● ●●●● ●●●● ●● ●● ●●●● ●●●
●● ●● ● ●●●● ● ●●●● ● ●●● ●●● ●●●●● ●●●●● ● ●●●● ●●● ●●● ● ● ●● ●● ●●
●●●● ●● ●● ●●● ●● ●● ●●●● ● ●●● ●●●● ●●●●● ●● ● ●●● ●●● ●●● ● ●●●● ●
●● ●●●●● ●● ●●● ●● ● ●● ●●● ●●●● ●●● ●● ●● ●●●● ● ●●●●●●● ●●●● ● ●●
1 2 3 4 5 6 7
●●●●● ●●●●●●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●
●● ●● ●●●● ●●● ●● ●● ●●● ●● ●● ●●●● ●●●● ●●● ●●●●●●● ●●●● ●●●●● ●
●● ●●● ●● ●● ●●●●●●●● ●●● ●● ●● ● ●●● ●● ● ●●● ● ●●●● ●●●● ●●●●●●●
●●●●● ●●●●●●●●●● ●●●●●● ●● ●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●
●●●● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●● ●●●●● ● ●●●●●●●● ●●● ●●●●● ●
●● ●● ●●●●● ●●● ●● ●●● ●●● ●●●● ●●●● ●● ●● ●●● ●●●● ● ●●● ● ●●●● ●●
1.0 1.5 2.0 2.5 3.0
1.0
1.5
2.0
2.5
3.0
Species
Let us, for the record, have a look at the covariance matrix:
cov(iris[,1:4])
And similarly the correlation matrix:
cor(iris[,1:4])
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 12
Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 0.69 -0.04 1.27 0.52Sepal.Width -0.04 0.19 -0.33 -0.12Petal.Length 1.27 -0.33 3.12 1.30Petal.Width 0.52 -0.12 1.30 0.58
Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 1.00 -0.12 0.87 0.82Sepal.Width -0.12 1.00 -0.43 -0.37Petal.Length 0.87 -0.43 1.00 0.96Petal.Width 0.82 -0.37 0.96 1.00
2.2.3 PCA of Iris data
First we do a basic PCA on covariances (WITHOUT Standardization - ONLY with centering):(and here using the PCA function of the ChemometricsWithR-package)
irisPC_without=PCA(scale(iris[,1:4], scale = FALSE))
Note that the scale-function is used here to just center the four variables.
# A good selection of 4 core plots:
par(mar=c(4,2,3,2),mfrow=c(2,2))
scoreplot(irisPC_without, col = iris$Species, main = "Scores")
loadingplot(irisPC_without, show.names = TRUE, main = "Loadings")
biplot(irisPC_without, score.col = iris$Species, main = "biplot")
screeplot(irisPC_without, type = "percentage", main = "Explained variance")
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 13
−3 −2 −1 0 1 2 3 4
−1.
0−
0.5
0.0
0.5
1.0
Scores
PC 1 (92.5%)
PC
2 (
5.3%
)
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●● ●
●
●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
0.0 0.2 0.4 0.6 0.8 1.0−
0.8
−0.
6−
0.4
−0.
20.
00.
2
Loadings
PC 1 (92.5%)
PC
2 (
5.3%
)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
−3 −2 −1 0 1 2 3 4
−3
−2
−1
01
23
4
biplot
PC 1 (92.5%)
PC
2 (
5.3%
)
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
● ●●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●●
●●
●● ● ●
●●
●●
●●●
●●
●●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●● ●
● ●
●
●●
●
●
●●●
●●
●
●
●●●
●
●●
●
−1 0 1 2
−1
01
2
1 2 3 4
Explained variance
# PCs
% v
aria
nce
020
4060
8010
0
And now the PCA on correlations (WITH Standardization - AND with centering):
irisPC <- PCA(scale(iris[,1:4]))
Note that the scale-function now is used to both center and standardize the four variables - hedefault choice of this function.
par(mar=c(4,2,3,2),mfrow=c(2,2))
scoreplot(irisPC, col = iris$Species, main = "Scores")
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 14
loadingplot(irisPC, show.names = TRUE, main = "Loadings")
biplot(irisPC, score.col = iris$Species, main = "biplot")
screeplot(irisPC, type = "percentage", main = "Explained variance")
−3 −2 −1 0 1 2 3
−2
−1
01
2
Scores
PC 1 (73.0%)
PC
2 (
22.9
%)
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
−0.2 0.0 0.2 0.4 0.6
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Loadings
PC 1 (73.0%)
PC
2 (
22.9
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
biplot
PC 1 (73.0%)
PC
2 (
22.9
%)
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
1 2 3 4
Explained variance
# PCs
% v
aria
nce
020
4060
8010
0
There can be other versions of the variance plot, e.g.:
par(mfrow=c(1,2))
plot(1:length(irisPC$var), irisPC$var, cex = 2,
ylab = "variance explained",xlab = "n PC")
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 15
lines(1:length(irisPC$var), irisPC$var)
plot(1:length(irisPC$var), irisPC$var/sum(irisPC$var), cex = 2,
ylab = "(explained variance)/(total variance)",xlab = "n PC")
lines(1:length(irisPC$var), irisPC$var/sum(irisPC$var))
●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
n PC
varia
nce
expl
aine
d
●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.0
0.2
0.4
0.6
n PC
(exp
lain
ed v
aria
nce)
/(to
tal v
aria
nce)
It can be useful to plot more components than just the first two:
# Scores:
pairs(scores(irisPC), col = iris$Species)
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 16
PC 1
−2 −1 0 1 2
●●
● ●●
●
●● ●
●●● ●
●
●● ● ●●
●
●●
●
●
●●●
● ●●
●
●
●●
●●●
● ●●
●
●
●
●● ●
● ●● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●●●
●●
●
●●●
●
●
●●
● ●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
● ●
●
●●
●
●●●
●
●●●
●●●
●
●●
●● ●
●
●●●
●●● ●
●
●●●●●
●
●●
●
●
●●●
●●●
●
●
●●
●●●
●●●
●
●
●
●● ●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
● ●●
●●
●
●● ●●
●
●●
● ●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●●
●
●●
●
●●
●
●● ●
●
●●●
●●●
●
−0.4 −0.2 0.0 0.2 0.4
−3
−2
−1
01
23
●●
●●●
●
●●●
● ●● ●
●
●● ●●●
●
●●
●
●
●●●
● ●●
●
●
●●
● ●●
● ●●
●
●
●
●● ●
● ●● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
● ●●
●●
●
●●●
●
●
●●
● ●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●
●●
●
●● ●
●
● ●●
●●●
●
−2
−1
01
2
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●
●
● PC 2●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●●
●
●●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
● ● ●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●●
●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
PC 3
−1.
0−
0.5
0.0
0.5
●
●
●●
● ●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●●
●
●●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−0.
4−
0.2
0.0
0.2
0.4
●
●
●
●●
●●●●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
● ●● ●
●●
●
●● ●●
●●●
●
● ●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●● ●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●●●
●●
●
●●● ●
●● ●
●
●●
● ●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5
●
●
●
●●●
● ●● ●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
● ●● ●
●●
●
●● ●●
●● ●
●
● ●
● ●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
PC 4
# Loadings:
par(mfrow = c(4,4), mar = c(4,4,.1,.1))
for (i in 1:4) for (j in 1:4) loadingplot(irisPC,
show.names = TRUE,pc=c(i,j), cex.lab=0.7)
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 17
−0.2 0.0 0.2 0.4 0.6
−0.
20.
00.
20.
40.
6
PC 1 (73.0%)
PC
1 (
73.0
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.2 0.0 0.2 0.4 0.6−
1.0
−0.
6−
0.2
0.0
PC 1 (73.0%)
PC
2 (
22.9
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.2 0.0 0.2 0.4 0.6
−0.
50.
00.
5
PC 1 (73.0%)
PC
3 (
3.7%
)
Sepal.Length
Sepal.WidthPetal.Length
Petal.Width
−0.2 0.0 0.2 0.4 0.6
−0.
50.
00.
5
PC 1 (73.0%)
PC
4 (
0.5%
)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
−1.0 −0.6 −0.2
−0.
20.
00.
20.
40.
6
PC 2 (22.9%)
PC
1 (
73.0
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−1.0 −0.6 −0.2
−1.
0−
0.6
−0.
20.
0
PC 2 (22.9%)
PC
2 (
22.9
%)
Sepal.Length
Sepal.Width
−1.0 −0.6 −0.2−
0.5
0.0
0.5
PC 2 (22.9%)
PC
3 (
3.7%
)
Sepal.Length
Sepal.WidthPetal.Length
Petal.Width
−1.0 −0.6 −0.2
−0.
50.
00.
5
PC 2 (22.9%)
PC
4 (
0.5%
)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
−0.5 0.0 0.5
−0.
20.
00.
20.
40.
6
PC 3 (3.7%)
PC
1 (
73.0
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.5 0.0 0.5
−1.
0−
0.6
−0.
20.
0
PC 3 (3.7%)
PC
2 (
22.9
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.5 0.0 0.5
−0.
50.
00.
5
PC 3 (3.7%)
PC
3 (
3.7%
)
Sepal.Length
Sepal.WidthPetal.Length
Petal.Width
−0.5 0.0 0.5−
0.5
0.0
0.5
PC 3 (3.7%)
PC
4 (
0.5%
)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
−1.0 −0.5 0.0 0.5
−0.
20.
00.
20.
40.
6
PC 4 (0.5%)
PC
1 (
73.0
%)
Sepal.Length
Sepal.Width
Petal.Length Petal.Width
−1.0 −0.5 0.0 0.5
−1.
0−
0.6
−0.
20.
0
PC 4 (0.5%)
PC
2 (
22.9
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−1.0 −0.5 0.0 0.5
−0.
50.
00.
5
PC 4 (0.5%)
PC
3 (
3.7%
)
Sepal.Length
Sepal.WidthPetal.Length
Petal.Width
−1.0 −0.5 0.0 0.5
−0.
50.
00.
5
PC 4 (0.5%)
PC
4 (
0.5%
)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
A much nicer biplot can be created by the ggbiplot-package: (Now using the prcomp-function todo the PCA)
ir.pca <- prcomp(iris[,1:4],
center = TRUE,
scale. = TRUE)
library(devtools)
# First time install: install_github("ggbiplot", "vqv")
library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1,
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 18
groups = iris[,5], ellipse = TRUE,
circle = FALSE)
print(g)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
Sepal.Length
Sep
al.W
idth
Petal.LengthPetal.Width
−2
−1
0
1
2
−2 0 2PC1 (73.0% explained var.)
PC
2 (2
2.9%
exp
lain
ed v
ar.)
groups
●
●
●
setosa
versicolor
virginica
Generally about interpreting PCA plots:
• Look at variances (scree) - hope for few(2) - look for the ”bend”
• Look at scores and loadings (e.g. biplot)
– Scores: OBSERVATION mapping
∗ preserves inter observation distances (as good as possible)
– Loadings: VARIABLE mapping (correlation structure)
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 19
∗ Variables in the SAME DIRECTION from (0,0) AND far away from (0,0) arehighly correlated
– Loadings tell us on which variables the observations differ
– An observation to the right has high values on the variables with (large) loadings tothe right
– An observation to the left has low values on the variables with (large) loadings to theright
• Look at residuals (Orthogonal distances) and leverages (score distances) (Outliers etc)
Finally, let us show some of the diagnostics (residuals) plotting. For this we will use the chemo-
metrics package: (and now the princomp function for the PCA)
library(chemometrics)
irisPCA <- princomp(iris[,1:4], cor = TRUE)
# The score distances res£SDist express the leverage values
# The orthogonal distances express the residuals
## Plots vs object number :
res <- pcaDiagplot(iris[,1:4], irisPCA, a = 2)
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 20
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
0 50 100 150
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Object number
Sco
re d
ista
nce
SD
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
Object number
Ort
hogo
nal d
ista
nce
OD
## Plot of the two agains each other:
par(mfrow=c(1,2))
plot(res$SDist, res$ODist, type = "n")
text(res$SDist, res$ODist, labels = row.names(iris))
## Explained variance for each variable
pcaVarexpl(iris[,1:4],a=2)
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 21
0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
res$SDist
res$
OD
ist
1
2
3
4
5 6
7
8
9
10 11
12
1314
15
16
17
18
19
20
21
22
23
24
2526
27
28
29
3031
32
33
34
35
36
37
38
3940
41
4243
44
45
46 474849
50
51
52
53
54
55
56
57
58
59
60
61
62
63
6465
6667
68
69
70
71
72
7374
7576
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
9293
94
95
96
97
98 99
100
101
102
103
104
105
106
107
108
109 110
111
112
113
114
115
116
117
118
119
120121
122123
124
125
126
127
128
129
130131
132
133
134
135
136
137
138
139
140
141142143
144
145
146
147148
149
150
Sepal.Length Petal.Length
Exp
lain
ed v
aria
nce
0.0
0.2
0.4
0.6
0.8
1.0
# Influence plot: residuals versus leverage
# for different number of components:
par(mfrow=c(2,2))
for (i in 1:4) {res=pcaDiagplot(iris[,1:4],a=i,irisPCA,plot=FALSE)
plot(res$SDist,res$ODist,type="n")
text(res$SDist,res$ODist,labels=row.names(iris))
}
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 22
0.0 0.5 1.0 1.5 2.0
0.5
1.0
1.5
2.0
2.5
res$SDist
res$
OD
ist
1
2
3
45
6
78
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
2425
26
27
2829303132
33
34
3536
37
38
39
4041
42
4344
45
46
47
48
49
50
51
52
53
54
5556
57
58
59
60
61
62
63
64
65
6667
68
69
70
71
72
73
7475
7677
7879
80
8182
8384
85
86
87
88
89
9091
92
93
94
95
9697
98
99
100
101
102
103
104
105
106
107
108109
110
111
112 113
114115
116
117
118
119
120
121122 123
124
125126
127128 129
130131
132
133134
135136
137
138139
140141142143 144
145
146
147
148
149
150
0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
res$SDistre
s$O
Dis
t
1
2
345 6
7
89
10 11
1213
14
15
16
17
18
19
20
21
22
23
24
2526
27
2829
3031
32
33
3435
36
37
383940
41
4243
4445
46 47484950
51
52
53
54
55
56
57
58
5960
61
62
63
6465
6667
68
69
70
71
72
7374
7576
77
78
79
80
81
82
8384
85
8687
88
89
90
91
9293
9495
9697
98 99
100
101
102
103104
105
106
107
108
109 110
111
112
113
114
115
116
117
118
119
120121
122123
124
125
126
127
128129
130131
132
133
134
135
136
137
138
139
140
141142143
144
145
146
147148
149
150
0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.1
0.2
0.3
0.4
0.5
res$SDist
res$
OD
ist
1
2
3
45
678 910
11
12
1314
15
16
17
1819
2021
22
23
24
25
2627
28
2930
31
32 33
3435
3637
38
3940
41
42
43
444546 47
4849
50 5152 53
54
55
56
57585960 61
62 63
64
65 66
676869
7071
72
73
74
75
76
7778
7980
818283
84
85
86
87
8889
90
91
92
93 94
95
96
97
98
99
100 101102103
104
105
106107
108
109110111
112
113
114
115
116117
118
119120
121
122
123124
125
126
127
128129
130
131
132
133
134
135
136
137
138
139
140
141
142
143144
145
146
147
148
149
150
0.5 1.0 1.5 2.0 2.5 3.0 3.5
5.0e
−16
1.5e
−15
2.5e
−15
res$SDist
res$
OD
ist
1
2
3
4
56
7
8910
11
1213
14
15
16
17
18
1920
21
22 23
24
252627282930
31 32
3334
35
363738
39
40
41
42
4344 45
46
47
48
4950
5152
53
54
5556
57
58
5960
61
62
63
6465
6667
68
69
70
7172
73
7475
767778
79
8081
82
8384
858687
88
89
90
91
92
93
94
95
969798
99
100
101102
103
104105
106107108
109110
111
112113 114 115
116117
118119120
121
122
123
124 125
126
127128
129130
131
132
133
134
135
136
137138
139
140141
142143
144
145 146
147
148
149
150
Finally, finally let us indicate how one could do some ’re-sampling’ (similar to ”jacknifing”):Leaving out a certain number of the observation and plotting the loadings and/or scores foreach subset data. First the loadings:
# Random samples of a certain proportion of the
# original number of observations are left out
par(mar = c(1,1,1,1), mfrow = c(3,3))
n=length(iris[,1])
leave_out_size=0.50
for (k in 1:9){irisPC=PCA(scale(iris[sample(1:n,round(n*(1-leave_out_size))),1:4]))
loadingplot(irisPC, show.names = TRUE, main = "Loadings")
eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 23
}
−0.2 0.0 0.2 0.4 0.6
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Loadings
PC 1 (72.8%)
PC
2 (
22.4
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.6 −0.4 −0.2 0.0 0.2 0.4
0.0
0.2
0.4
0.6
0.8
1.0 Loadings
PC 1 (73.2%)
PC
2 (
23.3
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.6 −0.4 −0.2 0.0 0.2
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Loadings
PC 1 (71.9%)
PC
2 (
23.9
%) Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
−0.6 −0.4 −0.2 0.0 0.2
0.0
0.2
0.4
0.6
0.8
1.0
Loadings
PC 1 (72.3%)
PC
2 (
24.0
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.6 −0.4 −0.2 0.0 0.2 0.4
0.0
0.2
0.4
0.6
0.8
1.0 Loadings
PC 1 (74.4%)
PC
2 (
20.8
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.2 0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.8
1.0
Loadings
PC 1 (73.1%)
PC
2 (
22.4
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.6 −0.4 −0.2 0.0 0.2
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Loadings
PC
2 (
23.1
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
−0.6 −0.4 −0.2 0.0 0.2 0.4
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Loadings
PC
2 (
22.0
%)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
−0.2 0.0 0.2 0.4 0.6
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Loadings
PC
2 (
23.4
%)
Sepal.Length
Sepal.Width
Petal.LengthPetal.Width
The the scores:
par(mar = c(1,1,1,1), mfrow = c(3,3))
for (k in 1:9){subsample <- sample(1:n,round(n*(1-leave_out_size)))
irisPC <- PCA(scale(iris[subsample,1:4]))
scoreplot(irisPC, col = iris$Species[subsample], main = "Scores")
}
eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 24
−3 −2 −1 0 1 2
−2
−1
01
2
Scores
PC 1 (75.4%)
PC
2 (
20.2
%)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
−3 −2 −1 0 1 2 3
−2
−1
01
2
Scores
PC 1 (74.5%)
PC
2 (
21.2
%)
●
●
●
● ●
●
●
●
●●
●
● ●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−2
−1
01
2
Scores
PC 1 (71.6%)
PC
2 (
24.5
%)
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−2
−1
01
2
Scores
PC 1 (70.4%)
PC
2 (
25.0
%)
●
●●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
−2 −1 0 1 2
−2
−1
01
2
Scores
PC 1 (69.1%)
PC
2 (
26.3
%)
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3−
10
12
Scores
PC 1 (74.0%)
PC
2 (
21.1
%)
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●●●
●●
●
−3 −2 −1 0 1 2
−2
−1
01
2
Scores
PC
2 (
23.4
%)
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−2
−1
01
2
Scores
PC
2 (
23.8
%)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−2
−1
01
2Scores
PC
2 (
21.7
%)
●
●●
●
●
●
●●
●
●
● ●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
The choice of showing 9 is arbitrary. Other plots of this re-sampling type could be thought of.
2.3 Spectral data example: yarn data
## Spectral data,
data(yarn) # Part of chemometrics package
# Try: ?yarn
dim(yarn$NIR)
## [1] 28 268
eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 25
par(mfrow = c(2, 2), mar = c(4, 4, .2, .2))
# Plotting of the 21 individual NIR spectra"
max_X=max(yarn$NIR)
min_X=min(yarn$NIR)
plot(yarn$NIR[1,],type="n",ylim=c(min_X,max_X))
for (i in 1:21) lines(yarn$NIR[i,],col=i)
# Plotting of the 21 individual NIR spectra - centered"
max_X=max(scale(yarn$NIR,scale=F))
min_X=min(scale(yarn$NIR,scale=F))
plot(scale(yarn$NIR[1,],scale=F),type="n",ylim=c(min_X,max_X))
for (i in 1:21) lines(scale(yarn$NIR,scale=F)[i,],col=i)
# Plotting of the 21 individual NIR spectra - centered and scaled"
max_X=max(scale(yarn$NIR))
min_X=min(scale(yarn$NIR))
plot(scale(yarn$NIR[1,]),type="n",ylim=c(min_X,max_X))
for (i in 1:21) lines(scale(yarn$NIR)[i,],col=i)
# Plotting of the principal variances: "
yarnPC <- PCA(scale(yarn$NIR))
plot(1:length(yarnPC$var),yarnPC$var,cex=2)
lines(1:length(yarnPC$var),yarnPC$var)
eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 26
0 50 100 150 200 250
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Index
yarn
$NIR
[1, ]
0 50 100 150 200 250
−1.
0−
0.5
0.0
0.5
1.0
Index
scal
e(ya
rn$N
IR[1
, ], s
cale
= F
)
0 50 100 150 200 250
−4
−3
−2
−1
01
23
Index
scal
e(ya
rn$N
IR[1
, ])
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●
0 5 10 15 20 25
020
4060
8010
012
014
0
1:length(yarnPC$var)
yarn
PC
$var
# Plot of y:
plot(yarn$density,type="n")
lines(yarn$density)
eNote 2 2.4 EXERCISES 27
0 5 10 15 20 25
020
4060
8010
0
Index
yarn
$den
sity
2.4 Exercises
Exercise 1 Fisher’s Iris data
First examine the raw data and examine whether there are obvious mistakes. After that onecould use other Unscrambler features to examine the statistical properties of the objects andvariable, but it in this case we go directly to PCA, as this give a very fine overview of the data,and will often show outliers immediately. Perform the PCA with leverage correction and withcentering. Examine the four standard plots (score plot, loading plot, influence plot and explainedvariance plot).
eNote 2 2.4 EXERCISES 28
a) How many principal components would you need and what does the first PC (PC1)describe?
b) How many percentage of the variation is described by the first two PCs?
c) Can you find an outlier? It so do you have an idea why thus outlier came about? (loadingsplot or scores plot)? In R: Do you see problem in the influence plot. If there is an outlier, inwhich other plot can you see the problem? If you see severe outliers, remove them fromthe data and run PCA again (and answer a, and b, again)
d) Does a standardization (autoscaling) give a better model? (answer a) and b) again)
e) How many PCs are needed to explain 70%, 75% and 90% of the variation in the data?
f) How many PCs can you maximally get in this dataset?
g) Compare the score and the loading plot, and make a biplot. Do any of the variables ”tellthe same story”?
h) Are any variables more discriminative the others? Are any variables dispensable?
i) Can you see the presupposed classes? Any class overlap?
eNote 2 2.4 EXERCISES 29
j) Does the original hypothesis seem to be OK?
Exercise 2 Wine Data (To be presented by Team 1 next time)
The second dataset is called VIN2:
• Forina, M., Armanino, C., Castino, M. and Ubigli, M. (1986). Multivariate data analysis asa discriminating method of the origin of wines. Vitis 25: 189-201.
• Forina, M., Lanteri, S., Armanino, C., Casolino, C. and Casale, M. 2010. V-PARVUS. Anextendable package of programs for data exploration, classification, and correlation. (www.parvus.unige.it)
The dataset VIN2.csv is an Excell CSV file. In this dataset there are 178 objects (Italian wines), thefirst 59 are Barolo wines (B1-B59), the next 71 are Grignolino wines (G60-G130) and the last 48are Barbera wines (S131-S178). These wines have been characterized by 13 variables (chemicaland physical measurements):
1. Alcohol (in %)
2. Malic acid
3. Ash
4. Alkalinity of Ash
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Colour intensity
11. Colour hue
12. OD280 / OD315 of diluted wines
13. Proline (amino acid)
eNote 2 2.4 EXERCISES 30
The wine data can allready be found within R, so no import is needed:
# Wines data:
# From the JCF uploaded file:
# Also slightly different from the version in the package
JCFwines=read.table("VIN2.csv",header=T,sep=";",dec=",")
# The wines data from the package:
# The wine class information is here stored in the wine.classes object
data(wines, package = "ChemometricsWithRData")
head(wines)
alcohol malic acid ash ash alkalinity magnesium tot. phenols
[1,] 13.20 1.78 2.14 11.2 100 2.65
[2,] 13.16 2.36 2.67 18.6 101 2.80
[3,] 14.37 1.95 2.50 16.8 113 3.85
[4,] 13.24 2.59 2.87 21.0 118 2.80
[5,] 14.20 1.76 2.45 15.2 112 3.27
[6,] 14.39 1.87 2.45 14.6 96 2.50
flavonoids non-flav. phenols proanth col. int. col. hue OD ratio
[1,] 2.76 0.26 1.28 4.38 1.05 3.40
[2,] 3.24 0.30 2.81 5.68 1.03 3.17
[3,] 3.49 0.24 2.18 7.80 0.86 3.45
[4,] 2.69 0.39 1.82 4.32 1.04 2.93
[5,] 3.39 0.34 1.97 6.75 1.05 2.85
[6,] 2.52 0.30 1.98 5.25 1.02 3.58
proline
[1,] 1050
[2,] 1185
[3,] 1480
[4,] 735
[5,] 1450
[6,] 1290
head(JCFwines)
X Wine F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12
1 S1 Barolo 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92
2 S2 Barolo 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40
3 S3 Barolo 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17
4 S4 Barolo 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45
eNote 2 2.4 EXERCISES 31
5 S5 Barolo 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93
6 S6 Barolo 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85
F13
1 1070
2 1050
3 1190
4 1480
5 735
6 1450
summary(wines)
alcohol malic acid ash ash alkalinity
Min. :11.0 Min. :0.74 Min. :1.36 Min. :10.6
1st Qu.:12.4 1st Qu.:1.60 1st Qu.:2.21 1st Qu.:17.2
Median :13.1 Median :1.87 Median :2.36 Median :19.5
Mean :13.0 Mean :2.34 Mean :2.37 Mean :19.5
3rd Qu.:13.7 3rd Qu.:3.10 3rd Qu.:2.56 3rd Qu.:21.5
Max. :14.8 Max. :5.80 Max. :3.23 Max. :30.0
magnesium tot. phenols flavonoids non-flav. phenols
Min. : 70.0 Min. :0.98 Min. :0.34 Min. :0.130
1st Qu.: 88.0 1st Qu.:1.74 1st Qu.:1.20 1st Qu.:0.270
Median : 98.0 Median :2.35 Median :2.13 Median :0.340
Mean : 99.6 Mean :2.29 Mean :2.02 Mean :0.362
3rd Qu.:107.0 3rd Qu.:2.80 3rd Qu.:2.86 3rd Qu.:0.440
Max. :162.0 Max. :3.88 Max. :5.08 Max. :0.660
proanth col. int. col. hue OD ratio
Min. :0.41 Min. : 1.28 Min. :0.480 Min. :1.27
1st Qu.:1.25 1st Qu.: 3.21 1st Qu.:0.780 1st Qu.:1.93
Median :1.55 Median : 4.68 Median :0.960 Median :2.78
Mean :1.59 Mean : 5.05 Mean :0.957 Mean :2.60
3rd Qu.:1.95 3rd Qu.: 6.20 3rd Qu.:1.120 3rd Qu.:3.17
Max. :3.58 Max. :13.00 Max. :1.710 Max. :4.00
proline
Min. : 278
1st Qu.: 500
Median : 672
Mean : 745
3rd Qu.: 985
Max. :1680
summary(JCFwines)
eNote 2 2.4 EXERCISES 32
X Wine F1 F2 F3
S1 : 1 Barbera:48 Min. : 3.67 Min. :0.74 Min. :1.36
S10 : 1 Barolo :59 1st Qu.:12.35 1st Qu.:1.60 1st Qu.:2.21
S100 : 1 Grigno :71 Median :13.05 Median :1.86 Median :2.36
S101 : 1 Mean :12.94 Mean :2.34 Mean :2.37
S102 : 1 3rd Qu.:13.67 3rd Qu.:3.08 3rd Qu.:2.56
S103 : 1 Max. :14.83 Max. :5.80 Max. :3.23
(Other):172
F4 F5 F6 F7
Min. :10.6 Min. : 70.0 Min. :0.98 Min. :0.34
1st Qu.:17.2 1st Qu.: 88.0 1st Qu.:1.74 1st Qu.:1.21
Median :19.5 Median : 98.0 Median :2.35 Median :2.13
Mean :19.5 Mean : 99.7 Mean :2.30 Mean :2.03
3rd Qu.:21.5 3rd Qu.:107.0 3rd Qu.:2.80 3rd Qu.:2.88
Max. :30.0 Max. :162.0 Max. :3.88 Max. :5.08
F8 F9 F10 F11
Min. :0.130 Min. :0.41 Min. : 1.28 Min. :0.480
1st Qu.:0.270 1st Qu.:1.25 1st Qu.: 3.22 1st Qu.:0.782
Median :0.340 Median :1.55 Median : 4.69 Median :0.965
Mean :0.362 Mean :1.59 Mean : 5.06 Mean :0.957
3rd Qu.:0.438 3rd Qu.:1.95 3rd Qu.: 6.20 3rd Qu.:1.120
Max. :0.660 Max. :3.58 Max. :13.00 Max. :1.710
F12 F13
Min. :0.56 Min. : 278
1st Qu.:1.92 1st Qu.: 500
Median :2.78 Median : 674
Mean :2.59 Mean : 753
3rd Qu.:3.17 3rd Qu.: 989
Max. :4.00 Max. :1940
a) Examine the raw data. Are there any severe outliers you can detect? What do you thinkhappened with the outlier, if any?
b) Correct wrong data, if any (in the excel file), and use PCA again. Does the score and loa-ding plot look significantly different now?
eNote 2 2.4 EXERCISES 33
c) Try PCA without standardization: Which variables are important here and why?
d) Try PCA with standardization. Which variables are important here, and would you recom-mend removing any of them from the data set? Which variables are especially importantfor the Barbera wines?
e) Suppose that alcohol % and proanthocyanins were especially healthy which wine wouldyou recommend?
f) Use some re-sampling/jack-knifing methods to test for significance of the variable - are allthe variables stable?
top related