the package pcapp pca by projection pursuit c. croux k.u ... · pca by projection pursuit the...

13
PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria June, 2006 Vienna University of Technology Joint work with . . . P. Filzmoser Department of Statistics and Probability Theory Vienna University of Technology, Austria C. Croux Department of Applied Economics K.U. Leuven, Belgium M.R. Oliveira Department of Mathematics Instituto Superior T´ ecnico, Lisbon, Portugal K. Kalcher Vienna University of Technology, Austria Agenda Principal components Robust approaches The implementation Supporting methods Covariance estimation by PCAs Principal Component Analysis (PCA) 0 1 2 3 4 0 1 2 3 4 x y

Upload: others

Post on 18-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

PCA by Projection PursuitThe Package pcaPP

Heinrich FritzVienna University of Technology, Austria

Vienna, Austria

June, 2006

Vienna University of Technology

Joint work with . . .

P. FilzmoserDepartment of Statistics and Probability Theory

Vienna University of Technology, Austria

C. CrouxDepartment of Applied Economics

K.U. Leuven, Belgium

M.R. OliveiraDepartment of Mathematics

Instituto Superior Tecnico, Lisbon, Portugal

K. KalcherVienna University of Technology, Austria

Agenda

• Principal components

• Robust approaches

• The implementation

• Supporting methods

• Covariance estimation by PCAs

Principal Component Analysis (PCA)

0 1 2 3 4

01

23

4

x

y

Page 2: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Principal Component Analysis (PCA)

0 1 2 3 4

01

23

4

x

y

Principal Component Analysis (PCA)

0 1 2 3 4

01

23

4

x

y

PC1

PC2

Outliers

0 1 2 3 4

01

23

4

x

y

Outliers

0 1 2 3 4

01

23

4

x

y

PC1

PC2

Page 3: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Outliers

0 1 2 3 4

01

23

4

x

y

PC1

PC2

PC1

PC2

The Classical Approach

• PCA by decomposition of the covariance matrix

Σ = ΓΛΓt

Y =(X − 1xt

• Robustness due to robust covariance estimates.

– package rrcov: covMCD, covMest

– package robustbase: covGK, covOGK

PCA by Projection Pursuit

• No covariance estimation necessary

• Especially for high dimensional data

• Procedure

– Define a data center (mean, median, l1median, . . . )

– Search for promising directions by maximizing a spread estimation (sd, mad,

qn) of the data projected onto these directions

– Reduce the amount of candidate directions

Defining the Data Center

0 1 2 3 4

01

23

4

x

y

Page 4: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.62

MAD = 0.54

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.62

MAD = 0.46

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.63

MAD = 0.4

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.63

MAD = 0.32

Page 5: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.64

MAD = 0.26

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.65

MAD = 0.25

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.65

MAD = 0.3

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.66

MAD = 0.35

Page 6: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.66

MAD = 0.43

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.67

MAD = 0.52

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.67

MAD = 0.63

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.66

MAD = 0.67

Page 7: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.66

MAD = 0.7

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.65

MAD = 0.69

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.65

MAD = 0.67

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.64

MAD = 0.61

Page 8: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.63

MAD = 0.64

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.63

MAD = 0.63

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.62

MAD = 0.66

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.62

MAD = 0.59

Page 9: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Maximizing Spread

−1 0 1 2 3 4 5

−1

01

23

45

x

y

s = 0.62

MAD = 0.54

PCAproj

0 1 2 3 4

01

23

4

x

y

PCAproj

0 1 2 3 4

01

23

4

x

y

PCAproj

0 1 2 3 4

01

23

4

x

y

Page 10: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

PCAproj

0 1 2 3 4

01

23

4

x

y

PCAproj

Candidate Directions:

• each data point

• additionally random di-

rections through center

• additional directions by

linear combinations of

data points

• update algorithm (based

on eigenvalues)

0 1 2 3 4

01

23

4

x

y

PCAgrid

Grid Algorithm:

Optimization is done on a

regular grid in the plane.

• select two variables

• optimization on the grid

• select other variables

• . . .

0 1 2 3 4

01

23

4

x

y

Implementation

• Implementation in C

• Wrapping functions

– PCAproj(x, k = 2, method = c("sd", "mad", "qn"), CalcMethod

= c("eachobs", "lincomb", "sphere"), nmax = 1000, update =

TRUE, scores = TRUE, maxit = 5, maxhalf = 5, control, ...)

– PCAgrid(x, k = 2, method = c("sd", "mad", "qn"), maxiter =

10, splitcircle = 10, scores = TRUE, anglehalving = TRUE,

fact2dim = 10, control, ...)

Page 11: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Common Parameters

• x: Data matrix (data frame)

• k: Number of principal components

• method: Spread estimator for projection pursuit

• scores: Return scores-matrix?

• control: Control-structure

• ... Passed to ScaleAdv

PCAproj - Individual Parameters

• CalcMethod: "eachobs","lincomb" or "sphere"

• nmax: Max directions to search in each step (for "lincomb"or "sphere")

• update: Perform update steps?

– maxhalf: Maximum number of steps for angle halving

– maxit: Maximum number of iterations

PCAgrid - Individual Parameters

• splitcircle: Number of directions

• anglehalving : Perform anglehalving

• fact2dim : Behavior in 2 dimensional case.

• maxiter: Maximum number of iterations.

Return Structure

• (S3) class pcaPP derived from princomp:

– sdev: Spread of principal components

– loadings: Matrix containing the loadings

– center: Center applied to the data matrix

– scale: Scale applied to the data matrix

– n.obs: Number of observations

– scores: Matrix containing the scores

– call: Function call

Page 12: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Additional Functions

• l1median(X, MaxStep = 200, ItTol = 10−8)

Robust center estimator

• qn(x)

Robust scale estimator

• ScaleAdv(x, center = mean, scale = sd)

Advanced scaling method (takes functions or vectors as input values)

Robust Covariance Estimation

• Robust covariance estimation based on PCs

Σ = ΓΛΓt

• covPCAproj(x, control)

• covPCAgrid(x, control)

• covPC(x, k, method) (under construction . . . )

Example

> library(pcaPP)

> data(swiss)

> result = PCAproj(swiss, k = 6, method = "mad")

> summary(result)

Importance of components:Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

Standard deviation 44.1005199 41.0302723 17.09114152 6.92022550 4.619893062Proportion of Variance 0.4859749 0.4206639 0.07299087 0.01196649 0.005333229Cumulative Proportion 0.4859749 0.9066387 0.97962962 0.99159611 0.996929342

Comp.6Standard deviation 3.505520822Proportion of Variance 0.003070658Cumulative Proportion 1.000000000

Example

screeplot(result)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6

Scree−plot

Variances

0500

1000

1500

Page 13: The Package pcaPP PCA by Projection Pursuit C. Croux K.U ... · PCA by Projection Pursuit The Package pcaPP Heinrich Fritz Vienna University of Technology, Austria Vienna, Austria

Example

biplot(result)

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

Biplot

Comp.1

Com

p.2

Courtelary

DelemontFranches−Mnt

Moutier

Neuveville

Porrentruy

BroyeGlane

Gruyere

Sarine

Veveyse

AigleAubonne

AvenchesCossonay

Echallens

Grandson

Lausanne

La Vallee

Lavaux

MorgesMoudonNyoneOrbe

Oron

PayernePaysd’enhaut

Rolle

Vevey

Yverdon

ContheyEntremont

Herens

MartigwyMontheySt Maurice

Sierre

Sion

Boudry

La Chauxdfnd

Le LocleNeuchatel

Val de Ruz

ValdeTravers

V. De Geneve

Rive Droite

Rive Gauche

−200 −100 0 100 200 300

−20

0−

100

010

020

030

0

Fertility

Agriculture

ExaminationEducation

Catholic

Infant.Mortality

Covariance Estimation

> library (covrob)

> covswiss.mad <- covrob (swiss, method="covPCAproj", control = list

(k=6,method="mad"))

> covswiss.sd <- covrob (swiss, method="covPCAproj", control = list

(k=6,method="sd"))

> plot (covswiss.mad, covswiss.sd)

Covariance Estimation

Fer

tility

Fertility

Agr

icul

ture

Agriculture

Exa

min

atio

n

Examination

Edu

catio

n

Education

Cat

holic

Catholic

Infa

nt.M

orta

lity

Infant.Mortality

56.08410.6773

−62.8119−233.6847

−120.8495−467.9115

−78.4054−140.1594

−130.6045−318.3069

65.9795144.4185

230.4597−166.5624

255.309−20.9876

−210.42712.7944

−106.399831.8595

11.9388195.156

6.665127.608

−3.9372−80.5001

−5.1075−49.1052

11.3704−226.1518

Robust cov − estimation based on PCs (projection mode − sd)

Robust cov − estimation based on PCs (projection mode − mad)