methodological concepts for source apportionment · analyzing compositional data with r. springer,...

Methodological Conceptsfor Source Apportionment

Peter Filzmoser

Institute of Statistics and Mathematical Methods in EconomicsVienna University of Technology

UBA Berlin, Germany

November 18, 2016

in collaboration with

Example data: POPs

POPs: Persistent Organic Pollutants7 dioxine compounds (D1-D7), 10 fouran compounds (F1-F10)Two “similar” observations, with very different scale (concentration levels)!

D1

D2

D3

D4

D5

D6

D7

F1

F2

F3

F4

F5

F6

F7

F8

F9

F10

020

0040

0060

0080

0010

000

Domestic heating, obs. 4

D1

D2

D3

D4

D5

D6

D7

F1

F2

F3

F4

F5

F6

F7

F8

F9

F10

020

040

060

080

010

00

Domestic heating, obs. 8

Should we normalize to row sum 1, or analyze (log-)ratios?

Compositional data

Definition: Compositional data consist of vectors x = (x1, . . . , xD) withD strictly positive components describing the parts on a whole, and whichcarry only relative information (Aitchison, 1986; Egozcue, 2009).

Consequences:

• The values x1, . . . , xD as such are not informative, but only their ratiosare of interest.

• The parts x1, . . . , xD do not need to sum up to 1.

• Compositional data follow the so-called Aitchison geometry on thesimplex (and not the standard Euclidean geometry).

Key reference:

J. Aitchison. The Statistical Analysis of Compositional Data. Chapman andHall, London, U.K., 1986.

Example data: POPs

Two compositions with D parts:

x = (x1, x2, . . . , xD)

x̃ = (x̃1, x̃2, . . . , x̃D)

Aitchison distance between x and x̃ is defined as:

dA(x, x̃) =1

D

D−1∑i=1

D∑j=i+1

(log

xixj− log

x̃ix̃j

)2

For comparison: Euclidean distance:

dE(x, x̃) =

√√√√√ D∑i=1

(xi − x̃i)2

Example data: POPs

D1

D2

D3

D4

D5

D6

D7

F1

F2

F3

F4

F5

F6

F7

F8

F9

F10

020

0040

0060

0080

0010

000

Original data

D1

D2

D3

D4

D5

D6

D7

F1

F2

F3

F4

F5

F6

F7

F8

F9

F10

0.0

0.1

0.2

0.3

0.4

Normalized to row sum 1

Aitchison distance: 1.21 Aitchison distance: 1.21Euclidean distance: 10636 Euclidean distance: 0.075

Euclidean geometry

Represent information from the simplex in the Euclidean geometry:

• clr (centered log-ratio) coefficients:Divide values by the geometric mean:

y = (y1, . . . , yD) =

log x1D√∏D

i=1 xi, . . . , log

xDD√∏D

i=1 xi

• ilr (isometric log-ratio) coordinates:z = (z1, . . . , zD−1) can be defined by

zi =

√D − i

D − i+1log

xiD−i√∏D

j=i+1 xj, i = 1, . . . , D − 1.

clr & ilr

Both, clr and ilr are isometric, which means that

dA(x, x̃) = dE(clr(x), clr(x̃)) = dE(ilr(x), ilr(x̃))

where dE stands for Euclidean distance.

This means that after expressing the information in clr coefficients or ilrcoordinates, the data are in the usual Euclidean geometry. Most standardstatistical methods are designed for this geometry.

A log-transformation does not transfer the compositions to this Euclideangeometry!

Example data

Data set with compositional parts (dioxins and indicator PCBs)PCB77, PCB126, PCB169, PCB105, PCB114, PCB118, PCB123, PCB156,PCB157, PCB167, PCB189, PCB28, PCB52, PCB101, PCB138, PCB153,PCB180measured in 56 “emissions” (ng/m3) and 65 “products” (µg/g or ng/g).

●●

●●●

●

●●●●●

●●●

●●

●

●

●

●●●

●●●●

●

●●

●●

●

●

●

●

●●●

●●●●●

●●●●

●●●

●●●●

●

●

●

●●

●●

●

●

●●

●●●●●●●●●●●●

●

●

●●●●

●●●●

●●

●●

●

●

●

●

●

●●●

●

●

●

●●●●

●●●

●

●●●●●

●●●

●●●

0 20 40 60 80 100 120

1e+

011e

+03

1e+

051e

+07

Sample number

Sum

of p

arts

Emissions Products

Correlation matrix

Correlations for “Emission” computed “classically” from original data:

PC

B77

PC

B11

4

PC

B10

5

PC

B12

6

PC

B16

9

PC

B15

7

PC

B18

9

PC

B16

7

PC

B15

6

PC

B28

PC

B52

PC

B10

1

PC

B12

3

PC

B11

8

PC

B18

0

PC

B15

3

PC

B13

8PCB77

PCB114

PCB105

PCB126

PCB169

PCB157

PCB189

PCB167

PCB156

PCB28

PCB52

PCB101

PCB123

PCB118

PCB180

PCB153

PCB138

Original data (not normalized)

−1 0 1

Value

010

Color Keyand Histogram

Cou

nt

Correlation matrix

Correlations for “Emission” computed “classically” from “length 1 data”:

PC

B77

PC

B11

4

PC

B10

5

PC

B12

6

PC

B16

9

PC

B15

7

PC

B18

9

PC

B16

7

PC

B15

6

PC

B28

PC

B52

PC

B10

1

PC

B12

3

PC

B11

8

PC

B18

0

PC

B15

3

PC

B13

8PCB77

PCB114

PCB105

PCB126

PCB169

PCB157

PCB189

PCB167

PCB156

PCB28

PCB52

PCB101

PCB123

PCB118

PCB180

PCB153

PCB138

Data normed to row sum 1

−1 0 1

Value

010


Cou

nt

Correlation matrix

Correlations for “Emission” computed from clr coefficients:

PC

B77

PC

B11

4

PC

B10

5

PC

B12

6

PC

B16

9

PC

B15

7

PC

B18

9

PC

B16

7

PC

B15

6

PC

B28

PC

B52

PC

B10

1

PC

B12

3

PC

B11

8

PC

B18

0

PC

B15

3

PC

B13

8PCB77

PCB114

PCB105

PCB126

PCB169

PCB157

PCB189

PCB167

PCB156

PCB28

PCB52

PCB101

PCB123

PCB118

PCB180

PCB153

PCB138

clr coefficients

−1 0 1

Value

010


Cou

nt

Robust PCA

for data normed to row sum 1 (left) and CoDa PCA (right)

●

●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●●

●●

●

●

●●

●

●●

●

●●

●●

●●

●

●

●

●●●

●●●

●

● ●

●

●

●

●

●

●

−2 0 2 4

−2

02

4

PC1

PC

2

−0.6 −0.2 0.0 0.2 0.4 0.6 0.8

−0.

6−

0.2

0.0

0.2

0.4

0.6

0.8

77126169

105

114

118

123 156157

167

189

28

52

101

138

153180

Data normed to row sum 1

●●

● ●●

●●

●

●●●●

●●●

●●

●●

●●●●

●●●

●

●

● ●●●

●

●

●

●●●

●●

●

●●

● ●●● ●

●●●

●● ●●

●

−10 −5 0 5 10

−10

−5

05

10

PC1 (clr)

PC

2 (c

lr)

−0.6 −0.2 0.0 0.2 0.4 0.6

−0.

6−

0.2

0.0

0.2

0.4

0.6

77

126

169

105114118123

156157167

189

28

52

101

138153180

CoDa PCA

Discriminate two groups

Two groups (red, green), normally distributed, equal covariance matrix:

x1 x2

x3

Original space

−2 0 2 4 6

−2

−1

01

2

z1

z 2

ilr space

Robust QDA (all POPs)

Robust QDA is applied to the ilr coordinates of the complete data set:Diagnostics based on robust distances: Mahalanobis distances for each group,using robust estimates of location and covariance from QDA.Can be used to assign (or not) new observations to a group.

0 20 40 60 80 100 120

05

1015

2025

30

Index

Rob

ust d

ista

nces

Emissions (correct)Emissions (wrong)Products (correct)Products (wrong)

Group profiles

Robust group centers can be back-transformed fromilr coordinates to the simplex.

PC

B77

PC

B12

6P

CB

169

PC

B10

5P

CB

114

PC

B11

8P

CB

123

PC

B15

6P

CB

157

PC

B16

7P

CB

189

PC

B28

PC

B52

PC

B10

1P

CB

138

PC

B15

3P

CB

180

0.00

0.05

0.10

0.15

0.20

Emissions

PC

B77

PC

B12

6P

CB

169

PC

B10

5P

CB

114

PC

B11

8P

CB

123

PC

B15

6P

CB

157

PC

B16

7P

CB

189

PC

B28

PC

B52

PC

B10

1P

CB

138

PC

B15

3P

CB

180

0.00

0.05

0.10

0.15

0.20

0.25

Products

Group profiles

Group centers obtained from median profiles,incorrectly computed in the simplex.

PC

B77

PC

B12

6P

CB

169

PC

B10

5P

CB

114

PC

B11

8P

CB

123

PC

B15

6P

CB

157

PC

B16

7P

CB

189

PC

B28

PC

B52

PC

B10

1P

CB

138

PC

B15

3P

CB

180

0.00

0.05

0.10

0.15

Emissions

PC

B77

PC

B12

6P

CB

169

PC

B10

5P

CB

114

PC

B11

8P

CB

123

PC

B15

6P

CB

157

PC

B16

7P

CB

189

PC

B28

PC

B52

PC

B10

1P

CB

138

PC

B15

3P

CB

180

0.00

0.05

0.10

0.15

Products

Benefit of log-ratios

So, what are ilr coordinates?They form an orthonormal basis, considering ALL pairwise log-ratios.They form an isometry, i.e.

Aitchison distances in the simplex = Euclidean distances in ilr.

log-ratios are scale invariant, i.e.

log(xixj

)= log

(const · xiconst · xj

)Thus, no matter if the data sum up to 1, andno matter if the data sum up to completely different concentration levels!

Why log-ratios?

• Compositions consist of relative contributions to a whole (mg/kg, %,proportions, etc.).This relative information is analyzed in form of log-ratios (log guaran-tees same variance if nominator and denominator change).

• Compositions live in the simplex sample space. They do not followthe usual Euclidean geometry, for which most statistical methods aredesigned.

• Work with ilr coordinates, and transform back for an interpretation, ifnecessary.

• An appropriate treatment does not necessarily lead to better results,but at least the interpretation of the results is correct.

Data quality issues

• Missing values can be imputed (with CoDa methods!), but they shouldoccur only rarely!

• Values BDL can be estimated (with CoDa methods!), but they shouldoccur only rarely, and the DL has to be known!

• Usually, multivariate methods require more samples than variables!

• If only few samples per group: covariance-based methods (discrimi-nant analysis) are problematic −→ switch to distance-based methods

Some references

J. Aitchison (1986). The statistical analysis of compositional data. Monographs onstatistics and applied probability. Chapman & Hall, London.

J.J. Egozcue, V. Pawlowsky-Glahn, G. Mateu-Figueraz, C. Barcelo-Vidal (2003). Isomet-ric log-ratio transformations for compositional data analysis. Mathematical Geology,35(3):279-300.

P. Filzmoser, K. Hron, and M. Templ (2012). Discriminant analysis for compositional dataand robust parameter estimation. Computational Statistics, 27(4):585-604.

V. Pawlowsky-Glahn and A. Buccianti (2011). Compositional data analysis: Theory andapplications. Wiley, Chichester.

V. Pawlowsky-Glahn, J.J. Egozcue, and R. Tolosana-Delgado (2015). Modeling andanalysis of compositional data. Wiley, Chichester.

K.G. van den Boogaart and R. Tolosana-Delgado (2013). Analyzing compositional datawith R. Springer, Heidelberg.

Software R

For the computations, we used the R package “robCompositions”:

• Principal component analysis (here robust):

res <- pcaCoDa(x) # x is the original data matrix

biplot(res) # shows a biplot

• Linear discriminant analysis (LDA):

x.ilr <- isomLR(x) # express data x in ilr coordinates

library(rrcov)

res <- LdaClassic(x.ilr,grp) # grp contains grouping info

predict(res) # shows error rate and confusion table

Use “Linda” for robust LDA, “QdaClassic” for quadratic discriminantanalysis (QDA), and “QdaCov” for robust QDA.

methodological concepts for source apportionment · analyzing compositional data with r. springer,...

Documents