implementation of anova-pca in r factorab: pca score plot...

Post on 03-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Implementation of ANOVA-PCA in R forMultivariate Data Exploration

Matthew J. Keinsley & Bryan A. Hanson

Dept. of Chemistry & BiochemistryDePauw University, Greencastle Indiana USA

Analysis of Variance

Analysis of Variance (ANOVA) is a significance testwhich considers whether or not two samples comefrom the same population. In the diagram below,each curve represents a series of measurements ontwo samples, with each described by a mean andstandard deviation. A null hypothesis is definedwhich typically states that the two samples are fromthe same population. ANOVA can be thought ofas measuring how much the two samples can driftapart and still come from the same population (i.e.the null hypothesis is true). The key values froman ANOVA calculation are the test statistic and thep-value. The p-value represents the probability ofachieving a test statistic more extreme than the oneobserved if the two samples are from the same pop-ulation. The smaller the p-value, the more confidentone is about rejecting the null hypothesis at a se-lected confidence level, usually 95%. Internally, thecalculation compares the between sample variationto the within sample variation to generate the teststatistic. This is referred to as ”partioning the vari-ance.”

0 5 10 15

0.0

0.1

0.2

0.3

0.4

sample values

Principal Component Analysis

Principal Component Analysis (PCA) is a commonmultivariate data analysis method which eliminatesuninformative variables (noise) and re-expresses thedata in terms of abstract ”principal components.”Thediagram below illustrates the matrix algebra used inthe computation. The PCA results are scores andloadings. Scores represent the samples in the newdata space, while loadings are the weights which eachvariable should be multiplied to obtain the score.While score plots are usually color-coded by groupmembership, PCA is blind to group membership.

1 ... j ... m

n

.

..

i

...

1

obje

cts

orsp

ecim

ens

measured variables

n

.

..

i

...

1

T

1 ... k ... a

m

...

j

...

1

P

1 ... k ... a

T = X • P or XP

PC1

scor

es fo

r eac

h ob

ject

...

PC2

scor

es fo

r eac

h ob

ject

...

etc

load

ings

for P

C1 ..

.lo

adin

gs fo

r PC2

...

etc

loadings

scoresoriginal data

X

Data Reduction method of PCA

P

T

ANOVA-PCA

ANOVA-PCA is a combination of both methods de-veloped by Harrington. The data is partitionedinto submatrices (shown below) corresponding toeach experimental factor in a manner reminiscent ofANOVA. The submatrices are then separately sub-jected to PCA after adding back the residual error. Ifthe effect of a factor is large compared to the residualerror, separation along the 1st PC in the score plotshould be evident. With this method, the significanceof a factor can be visually determined. ANOVA-PCAis not blind to group membership.

RawData = Grand

MeanMeans foreach levelof factor 1

Means foreach levelof factor 2

Means foreach level

offactor 1 x factor 2

ResidualError+

+

+

+

ANOVA-PCA con’t

The method to create the submatrices is shown be-low. Data for all samples belonging to each level of aparticular factor are replaced by their group averages.

##########################################################

#############################

#############################

#############################

#############################

##########################################################

#############################

#############################

#############################

#############################

original data factor matrix

group membership

rows replaced byaverage of allsamples of a givenfactor level

##########################################################

#############################

#############################

#############################

#############################

##########################################################

#############################

#############################

#############################

#############################

original data factor matrix

group membership

Computational Strategy

We implemented the ANOVA-PCA concept using aseries of R functions as shown below. The functionswere integrated into the package ChemoSpec, whichuses a Spectra object to store the spectral data andassociated information.

aovPCA(spectra, facs, ...)

avgFacLvls(spectra, facs)aovPCAscores(spectra, LM, plot, type, ...)

Spectra object factors to beanalyzed

aovPCAloadings(spectra, LM, pca, plot, ...)

Simulated Data

For testing purposes, a simulated data set was cre-ated by generating data which simulate UV-Vis spec-tra or chromatograms. Four different sample typeswere created in such a way as to represent groupmembership and the effects of the factors. The lev-els of the first factor are A and B; the second factorhas levels X and Y. The entire data set contains 5 ofeach sample type, and noise was added to the spectrain order to make them more realistic. The prototypesamples are shown below; type BY represents a inter-action factor as a peak is missing entirely comparedto the other sample types.

0 5 10 15

0.0

0.2

0.4

Sample Type AX

Wavelength or Time

Pea

k in

tens

ity

0 5 10 15

0.0

1.0

2.0

Sample Type AY

Wavelength or Time

Pea

k in

tens

ity

0 5 10 15

0.0

1.0

2.0

Sample Type BX

Wavelength or Time

Pea

k in

tens

ity

0 5 10 15

0.0

1.0

2.0

Sample Type BY

Wavelength or Time

Pea

k in

tens

ity

Standard PCA

The results of a traditional PCA analysis on thesesimulated data are shown below. As expected,this data set does not present a challenge andeach group is clustered away from the others.

−10 −5 0 5 10

−5

05

10

Original Data: PCA Score Plot

robust ellipses by group

PC1 score (30%)

PC

2 sc

ore

(8.6

%)

centered/noscale/classical KeyAXAYBXBY

aovPCA Score Plots

The following figures show the results of aovPCA

on the simulated data. Factor 1 has levelsA and B. Because the 1st factor is significant,there is separation between levels along PC1.

● ●

−5 0 5 10

−10

−5

05

1015

FactorAB: PCA Score Plot

robust ellipses by group

PC1 score (32%)

PC

2 sc

ore

(5.5

%)

centered/noscale/classical KeyAXAYBXBY

Factor 2 has levels X and Y. As before, separation isobserved along PC1, but it is not as good as for Fac-tor 1. Note the groups here are reversed comparedto the 1st factor.

●●

−6 −4 −2 0 2 4 6

−10

010

20

FactorXY: PCA Score Plot

robust ellipses by group

PC1 score (9.1%)

PC

2 sc

ore

(7.4

%)

centered/noscale/classical KeyAXAYBXBY

The following figure shows the interaction betweenFactors 1 and 2. In this case, AX and BY aregrouped separately from AY and BX. This sep-aration therefore shows that the interaction be-tween the two factors is significant as expected.

●●

−6 −4 −2 0 2 4 6

−10

010

20

FactorAB x FactorXY: PCA Score Plot

robust ellipses by group

PC1 score (15%)

PC

2 sc

ore

(6.9

%)

centered/noscale/classical KeyAXAYBXBY

Finally, aovPCA on the residual error is shown. Theresidual error is the unexplained variance, so thereshould be no separation on PC1 nor any clustering.This is exactly what is observed.

●●

●●● ●

●●●

●●●

−20 −10 0 10 20 30

−40

−20

020

40

Res.Error: PCA Score Plot

robust ellipses by group

PC1 score (8.1%)

PC

2 sc

ore

(7.5

%)

centered/noscale/classical KeyAXAYBXBY

aovPCA Loadings Plots

The loading for Factor AB is shown below. Peaks onthe PC1 loadings show which wavelengths contributethe most to the separation in the data. The referencespectrum clearly shows the noise that was added tothe data.

FactorAB: Loadings Plot

centered/noscale/classical

−0.

50.

00.

51.

0

0 5 10 15

Ref

eren

ce S

pect

rum

−0.

050.

000.

050.

10

PC

1 L

oadi

ngs

aovPCA of Real Data

The data in this example is composed of infrared (IR)spectra of the leaf surface of the common ”weed”purslane (Portulaca oleracea). Varieties collected inSouth Carolina and Wisconsin were grown in low andhigh water conditions as a part of a larger study onclimate change. The aovPCA score plot with Treat-ment as the factor is shown below; aovPCA is notable to separate the groups based upon this factor.

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●

●●

−0.02 −0.01 0.00 0.01 0.02 0.03

−0.

010

−0.

005

0.00

00.

005

0.01

0

Treatment: PCA Score Plot

robust ellipses by group

PC1 score (77%)

PC

2 sc

ore

(13%

)

centered/noscale/classical KeySC.HSC.LWI.HWI.L

However, if the first 3 PCs from standardPCA are plotted, we can see separation be-tween the treatments. With this difficultdata set standard PCA outperforms aovPCA.

Cuticle IR Spectra: PCA Score Plot

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

PC1 (83%)

PC2 (14%)

PC3 (1.1%)

SC.H SC.L WI.H WI.L

EB water trial week 3 IR Spectra Summer 2010

References

Pinto, Bosc, Nocairi, Barros, and Rutledge. ”UsingANOVA-PCA for Discriminant Analysis. . . ” Ana-lytica Chimica Acta 629.1-2 (2008): 47-55.

Harrington, Vieira, Espinoza, Nien, Romero, andYergey. ”Analysis of Variance–Principal Compo-nent Analysis. . . ”Analytica Chimica Acta 544.1-2(2005): 118-27.

Software: R packages ChemoSpec and HandyStuff

github.com/bryanhanson

Acknowledgements

Support was provided by the Science Research Fel-lows program and Chemistry Department researchfunds.

Summer 2011

top related