the package pcapp pca by projection pursuit c. croux k.u ... · pca by projection pursuit the...
TRANSCRIPT
PCA by Projection PursuitThe Package pcaPP
Heinrich FritzVienna University of Technology, Austria
Vienna, Austria
June, 2006
Vienna University of Technology
Joint work with . . .
P. FilzmoserDepartment of Statistics and Probability Theory
Vienna University of Technology, Austria
C. CrouxDepartment of Applied Economics
K.U. Leuven, Belgium
M.R. OliveiraDepartment of Mathematics
Instituto Superior Tecnico, Lisbon, Portugal
K. KalcherVienna University of Technology, Austria
Agenda
• Principal components
• Robust approaches
• The implementation
• Supporting methods
• Covariance estimation by PCAs
Principal Component Analysis (PCA)
0 1 2 3 4
01
23
4
x
y
Principal Component Analysis (PCA)
0 1 2 3 4
01
23
4
x
y
Principal Component Analysis (PCA)
0 1 2 3 4
01
23
4
x
y
PC1
PC2
Outliers
0 1 2 3 4
01
23
4
x
y
Outliers
0 1 2 3 4
01
23
4
x
y
PC1
PC2
Outliers
0 1 2 3 4
01
23
4
x
y
PC1
PC2
PC1
PC2
The Classical Approach
• PCA by decomposition of the covariance matrix
Σ = ΓΛΓt
Y =(X − 1xt
)Γ
• Robustness due to robust covariance estimates.
– package rrcov: covMCD, covMest
– package robustbase: covGK, covOGK
PCA by Projection Pursuit
• No covariance estimation necessary
• Especially for high dimensional data
• Procedure
– Define a data center (mean, median, l1median, . . . )
– Search for promising directions by maximizing a spread estimation (sd, mad,
qn) of the data projected onto these directions
– Reduce the amount of candidate directions
Defining the Data Center
0 1 2 3 4
01
23
4
x
y
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.62
MAD = 0.54
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.62
MAD = 0.46
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.63
MAD = 0.4
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.63
MAD = 0.32
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.64
MAD = 0.26
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.65
MAD = 0.25
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.65
MAD = 0.3
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.66
MAD = 0.35
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.66
MAD = 0.43
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.67
MAD = 0.52
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.67
MAD = 0.63
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.66
MAD = 0.67
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.66
MAD = 0.7
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.65
MAD = 0.69
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.65
MAD = 0.67
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.64
MAD = 0.61
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.63
MAD = 0.64
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.63
MAD = 0.63
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.62
MAD = 0.66
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.62
MAD = 0.59
Maximizing Spread
−1 0 1 2 3 4 5
−1
01
23
45
x
y
s = 0.62
MAD = 0.54
PCAproj
0 1 2 3 4
01
23
4
x
y
PCAproj
0 1 2 3 4
01
23
4
x
y
PCAproj
0 1 2 3 4
01
23
4
x
y
PCAproj
0 1 2 3 4
01
23
4
x
y
PCAproj
Candidate Directions:
• each data point
• additionally random di-
rections through center
• additional directions by
linear combinations of
data points
• update algorithm (based
on eigenvalues)
0 1 2 3 4
01
23
4
x
y
PCAgrid
Grid Algorithm:
Optimization is done on a
regular grid in the plane.
• select two variables
• optimization on the grid
• select other variables
• . . .
0 1 2 3 4
01
23
4
x
y
Implementation
• Implementation in C
• Wrapping functions
– PCAproj(x, k = 2, method = c("sd", "mad", "qn"), CalcMethod
= c("eachobs", "lincomb", "sphere"), nmax = 1000, update =
TRUE, scores = TRUE, maxit = 5, maxhalf = 5, control, ...)
– PCAgrid(x, k = 2, method = c("sd", "mad", "qn"), maxiter =
10, splitcircle = 10, scores = TRUE, anglehalving = TRUE,
fact2dim = 10, control, ...)
Common Parameters
• x: Data matrix (data frame)
• k: Number of principal components
• method: Spread estimator for projection pursuit
• scores: Return scores-matrix?
• control: Control-structure
• ... Passed to ScaleAdv
PCAproj - Individual Parameters
• CalcMethod: "eachobs","lincomb" or "sphere"
• nmax: Max directions to search in each step (for "lincomb"or "sphere")
• update: Perform update steps?
– maxhalf: Maximum number of steps for angle halving
– maxit: Maximum number of iterations
PCAgrid - Individual Parameters
• splitcircle: Number of directions
• anglehalving : Perform anglehalving
• fact2dim : Behavior in 2 dimensional case.
• maxiter: Maximum number of iterations.
Return Structure
• (S3) class pcaPP derived from princomp:
– sdev: Spread of principal components
– loadings: Matrix containing the loadings
– center: Center applied to the data matrix
– scale: Scale applied to the data matrix
– n.obs: Number of observations
– scores: Matrix containing the scores
– call: Function call
Additional Functions
• l1median(X, MaxStep = 200, ItTol = 10−8)
Robust center estimator
• qn(x)
Robust scale estimator
• ScaleAdv(x, center = mean, scale = sd)
Advanced scaling method (takes functions or vectors as input values)
Robust Covariance Estimation
• Robust covariance estimation based on PCs
Σ = ΓΛΓt
• covPCAproj(x, control)
• covPCAgrid(x, control)
• covPC(x, k, method) (under construction . . . )
Example
> library(pcaPP)
> data(swiss)
> result = PCAproj(swiss, k = 6, method = "mad")
> summary(result)
Importance of components:Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 44.1005199 41.0302723 17.09114152 6.92022550 4.619893062Proportion of Variance 0.4859749 0.4206639 0.07299087 0.01196649 0.005333229Cumulative Proportion 0.4859749 0.9066387 0.97962962 0.99159611 0.996929342
Comp.6Standard deviation 3.505520822Proportion of Variance 0.003070658Cumulative Proportion 1.000000000
Example
screeplot(result)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Scree−plot
Variances
0500
1000
1500
Example
biplot(result)
−0.2 −0.1 0.0 0.1 0.2
−0.
2−
0.1
0.0
0.1
0.2
Biplot
Comp.1
Com
p.2
Courtelary
DelemontFranches−Mnt
Moutier
Neuveville
Porrentruy
BroyeGlane
Gruyere
Sarine
Veveyse
AigleAubonne
AvenchesCossonay
Echallens
Grandson
Lausanne
La Vallee
Lavaux
MorgesMoudonNyoneOrbe
Oron
PayernePaysd’enhaut
Rolle
Vevey
Yverdon
ContheyEntremont
Herens
MartigwyMontheySt Maurice
Sierre
Sion
Boudry
La Chauxdfnd
Le LocleNeuchatel
Val de Ruz
ValdeTravers
V. De Geneve
Rive Droite
Rive Gauche
−200 −100 0 100 200 300
−20
0−
100
010
020
030
0
Fertility
Agriculture
ExaminationEducation
Catholic
Infant.Mortality
Covariance Estimation
> library (covrob)
> covswiss.mad <- covrob (swiss, method="covPCAproj", control = list
(k=6,method="mad"))
> covswiss.sd <- covrob (swiss, method="covPCAproj", control = list
(k=6,method="sd"))
> plot (covswiss.mad, covswiss.sd)
Covariance Estimation
Fer
tility
Fertility
Agr
icul
ture
Agriculture
Exa
min
atio
n
Examination
Edu
catio
n
Education
Cat
holic
Catholic
Infa
nt.M
orta
lity
Infant.Mortality
56.08410.6773
−62.8119−233.6847
−120.8495−467.9115
−78.4054−140.1594
−130.6045−318.3069
65.9795144.4185
230.4597−166.5624
255.309−20.9876
−210.42712.7944
−106.399831.8595
11.9388195.156
6.665127.608
−3.9372−80.5001
−5.1075−49.1052
11.3704−226.1518
Robust cov − estimation based on PCs (projection mode − sd)
Robust cov − estimation based on PCs (projection mode − mad)