methodological concepts for source apportionment · analyzing compositional data with r. springer,...
TRANSCRIPT
Methodological Conceptsfor Source Apportionment
Peter Filzmoser
Institute of Statistics and Mathematical Methods in EconomicsVienna University of Technology
UBA Berlin, Germany
November 18, 2016
in collaboration with
Example data: POPs
POPs: Persistent Organic Pollutants7 dioxine compounds (D1-D7), 10 fouran compounds (F1-F10)Two “similar” observations, with very different scale (concentration levels)!
D1
D2
D3
D4
D5
D6
D7
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
020
0040
0060
0080
0010
000
Domestic heating, obs. 4
D1
D2
D3
D4
D5
D6
D7
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
020
040
060
080
010
00
Domestic heating, obs. 8
Should we normalize to row sum 1, or analyze (log-)ratios?
Compositional data
Definition: Compositional data consist of vectors x = (x1, . . . , xD) withD strictly positive components describing the parts on a whole, and whichcarry only relative information (Aitchison, 1986; Egozcue, 2009).
Consequences:
• The values x1, . . . , xD as such are not informative, but only their ratiosare of interest.
• The parts x1, . . . , xD do not need to sum up to 1.
• Compositional data follow the so-called Aitchison geometry on thesimplex (and not the standard Euclidean geometry).
Key reference:
J. Aitchison. The Statistical Analysis of Compositional Data. Chapman andHall, London, U.K., 1986.
Example data: POPs
Two compositions with D parts:
x = (x1, x2, . . . , xD)
x̃ = (x̃1, x̃2, . . . , x̃D)
Aitchison distance between x and x̃ is defined as:
dA(x, x̃) =1
D
D−1∑i=1
D∑j=i+1
(log
xixj− log
x̃ix̃j
)2
For comparison: Euclidean distance:
dE(x, x̃) =
√√√√√ D∑i=1
(xi − x̃i)2
Example data: POPs
D1
D2
D3
D4
D5
D6
D7
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
020
0040
0060
0080
0010
000
Original data
D1
D2
D3
D4
D5
D6
D7
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
0.0
0.1
0.2
0.3
0.4
Normalized to row sum 1
Aitchison distance: 1.21 Aitchison distance: 1.21Euclidean distance: 10636 Euclidean distance: 0.075
Euclidean geometry
Represent information from the simplex in the Euclidean geometry:
• clr (centered log-ratio) coefficients:Divide values by the geometric mean:
y = (y1, . . . , yD) =
log x1D√∏D
i=1 xi, . . . , log
xDD√∏D
i=1 xi
• ilr (isometric log-ratio) coordinates:z = (z1, . . . , zD−1) can be defined by
zi =
√D − i
D − i+1log
xiD−i√∏D
j=i+1 xj, i = 1, . . . , D − 1.
clr & ilr
Both, clr and ilr are isometric, which means that
dA(x, x̃) = dE(clr(x), clr(x̃)) = dE(ilr(x), ilr(x̃))
where dE stands for Euclidean distance.
This means that after expressing the information in clr coefficients or ilrcoordinates, the data are in the usual Euclidean geometry. Most standardstatistical methods are designed for this geometry.
A log-transformation does not transfer the compositions to this Euclideangeometry!
Example data
Data set with compositional parts (dioxins and indicator PCBs)PCB77, PCB126, PCB169, PCB105, PCB114, PCB118, PCB123, PCB156,PCB157, PCB167, PCB189, PCB28, PCB52, PCB101, PCB138, PCB153,PCB180measured in 56 “emissions” (ng/m3) and 65 “products” (µg/g or ng/g).
●●
●●●
●
●●●●●
●●●
●●
●
●
●
●●●
●●●●
●
●●
●●
●
●
●
●
●●●
●●●●●
●●●●
●●●
●●●●
●
●
●
●●
●●
●
●
●●
●●●●●●●●●●●●
●
●
●●●●
●●●●
●●
●●
●
●
●
●
●
●●●
●
●
●
●●●●
●●●
●
●●●●●
●●●
●●●
0 20 40 60 80 100 120
1e+
011e
+03
1e+
051e
+07
Sample number
Sum
of p
arts
Emissions Products
Correlation matrix
Correlations for “Emission” computed “classically” from original data:
PC
B77
PC
B11
4
PC
B10
5
PC
B12
6
PC
B16
9
PC
B15
7
PC
B18
9
PC
B16
7
PC
B15
6
PC
B28
PC
B52
PC
B10
1
PC
B12
3
PC
B11
8
PC
B18
0
PC
B15
3
PC
B13
8PCB77
PCB114
PCB105
PCB126
PCB169
PCB157
PCB189
PCB167
PCB156
PCB28
PCB52
PCB101
PCB123
PCB118
PCB180
PCB153
PCB138
Original data (not normalized)
−1 0 1
Value
010
Color Keyand Histogram
Cou
nt
Correlation matrix
Correlations for “Emission” computed “classically” from “length 1 data”:
PC
B77
PC
B11
4
PC
B10
5
PC
B12
6
PC
B16
9
PC
B15
7
PC
B18
9
PC
B16
7
PC
B15
6
PC
B28
PC
B52
PC
B10
1
PC
B12
3
PC
B11
8
PC
B18
0
PC
B15
3
PC
B13
8PCB77
PCB114
PCB105
PCB126
PCB169
PCB157
PCB189
PCB167
PCB156
PCB28
PCB52
PCB101
PCB123
PCB118
PCB180
PCB153
PCB138
Data normed to row sum 1
−1 0 1
Value
010
Color Keyand Histogram
Cou
nt
Correlation matrix
Correlations for “Emission” computed from clr coefficients:
PC
B77
PC
B11
4
PC
B10
5
PC
B12
6
PC
B16
9
PC
B15
7
PC
B18
9
PC
B16
7
PC
B15
6
PC
B28
PC
B52
PC
B10
1
PC
B12
3
PC
B11
8
PC
B18
0
PC
B15
3
PC
B13
8PCB77
PCB114
PCB105
PCB126
PCB169
PCB157
PCB189
PCB167
PCB156
PCB28
PCB52
PCB101
PCB123
PCB118
PCB180
PCB153
PCB138
clr coefficients
−1 0 1
Value
010
Color Keyand Histogram
Cou
nt
Robust PCA
for data normed to row sum 1 (left) and CoDa PCA (right)
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●
●●●
●●●
●
● ●
●
●
●
●
●
●
−2 0 2 4
−2
02
4
PC1
PC
2
−0.6 −0.2 0.0 0.2 0.4 0.6 0.8
−0.
6−
0.2
0.0
0.2
0.4
0.6
0.8
77126169
105
114
118
123 156157
167
189
28
52
101
138
153180
Data normed to row sum 1
●●
● ●●
●●
●
●●●●
●●●
●●
●●
●●●●
●●●
●
●
● ●●●
●
●
●
●●●
●●
●
●●
● ●●● ●
●●●
●● ●●
●
−10 −5 0 5 10
−10
−5
05
10
PC1 (clr)
PC
2 (c
lr)
−0.6 −0.2 0.0 0.2 0.4 0.6
−0.
6−
0.2
0.0
0.2
0.4
0.6
77
126
169
105114118123
156157167
189
28
52
101
138153180
CoDa PCA
Discriminate two groups
Two groups (red, green), normally distributed, equal covariance matrix:
x1 x2
x3
Original space
−2 0 2 4 6
−2
−1
01
2
z1
z 2
ilr space
Discriminate two groups
Two groups (red, green), normally distributed, equal covariance matrix:
x1 x2
x3
Original space
−2 0 2 4 6
−2
−1
01
2
z1
z 2
ilr space
Robust QDA (all POPs)
Robust QDA is applied to the ilr coordinates of the complete data set:Diagnostics based on robust distances: Mahalanobis distances for each group,using robust estimates of location and covariance from QDA.Can be used to assign (or not) new observations to a group.
0 20 40 60 80 100 120
05
1015
2025
30
Index
Rob
ust d
ista
nces
Emissions (correct)Emissions (wrong)Products (correct)Products (wrong)
Group profiles
Robust group centers can be back-transformed fromilr coordinates to the simplex.
PC
B77
PC
B12
6P
CB
169
PC
B10
5P
CB
114
PC
B11
8P
CB
123
PC
B15
6P
CB
157
PC
B16
7P
CB
189
PC
B28
PC
B52
PC
B10
1P
CB
138
PC
B15
3P
CB
180
0.00
0.05
0.10
0.15
0.20
Emissions
PC
B77
PC
B12
6P
CB
169
PC
B10
5P
CB
114
PC
B11
8P
CB
123
PC
B15
6P
CB
157
PC
B16
7P
CB
189
PC
B28
PC
B52
PC
B10
1P
CB
138
PC
B15
3P
CB
180
0.00
0.05
0.10
0.15
0.20
0.25
Products
Group profiles
Group centers obtained from median profiles,incorrectly computed in the simplex.
PC
B77
PC
B12
6P
CB
169
PC
B10
5P
CB
114
PC
B11
8P
CB
123
PC
B15
6P
CB
157
PC
B16
7P
CB
189
PC
B28
PC
B52
PC
B10
1P
CB
138
PC
B15
3P
CB
180
0.00
0.05
0.10
0.15
Emissions
PC
B77
PC
B12
6P
CB
169
PC
B10
5P
CB
114
PC
B11
8P
CB
123
PC
B15
6P
CB
157
PC
B16
7P
CB
189
PC
B28
PC
B52
PC
B10
1P
CB
138
PC
B15
3P
CB
180
0.00
0.05
0.10
0.15
Products
Benefit of log-ratios
So, what are ilr coordinates?They form an orthonormal basis, considering ALL pairwise log-ratios.They form an isometry, i.e.
Aitchison distances in the simplex = Euclidean distances in ilr.
log-ratios are scale invariant, i.e.
log(xixj
)= log
(const · xiconst · xj
)Thus, no matter if the data sum up to 1, andno matter if the data sum up to completely different concentration levels!
Why log-ratios?
• Compositions consist of relative contributions to a whole (mg/kg, %,proportions, etc.).This relative information is analyzed in form of log-ratios (log guaran-tees same variance if nominator and denominator change).
• Compositions live in the simplex sample space. They do not followthe usual Euclidean geometry, for which most statistical methods aredesigned.
• Work with ilr coordinates, and transform back for an interpretation, ifnecessary.
• An appropriate treatment does not necessarily lead to better results,but at least the interpretation of the results is correct.
Data quality issues
• Missing values can be imputed (with CoDa methods!), but they shouldoccur only rarely!
• Values BDL can be estimated (with CoDa methods!), but they shouldoccur only rarely, and the DL has to be known!
• Usually, multivariate methods require more samples than variables!
• If only few samples per group: covariance-based methods (discrimi-nant analysis) are problematic −→ switch to distance-based methods
Some references
J. Aitchison (1986). The statistical analysis of compositional data. Monographs onstatistics and applied probability. Chapman & Hall, London.
J.J. Egozcue, V. Pawlowsky-Glahn, G. Mateu-Figueraz, C. Barcelo-Vidal (2003). Isomet-ric log-ratio transformations for compositional data analysis. Mathematical Geology,35(3):279-300.
P. Filzmoser, K. Hron, and M. Templ (2012). Discriminant analysis for compositional dataand robust parameter estimation. Computational Statistics, 27(4):585-604.
V. Pawlowsky-Glahn and A. Buccianti (2011). Compositional data analysis: Theory andapplications. Wiley, Chichester.
V. Pawlowsky-Glahn, J.J. Egozcue, and R. Tolosana-Delgado (2015). Modeling andanalysis of compositional data. Wiley, Chichester.
K.G. van den Boogaart and R. Tolosana-Delgado (2013). Analyzing compositional datawith R. Springer, Heidelberg.
Software R
For the computations, we used the R package “robCompositions”:
• Principal component analysis (here robust):
res <- pcaCoDa(x) # x is the original data matrix
biplot(res) # shows a biplot
• Linear discriminant analysis (LDA):
x.ilr <- isomLR(x) # express data x in ilr coordinates
library(rrcov)
res <- LdaClassic(x.ilr,grp) # grp contains grouping info
predict(res) # shows error rate and confusion table
Use “Linda” for robust LDA, “QdaClassic” for quadratic discriminantanalysis (QDA), and “QdaCov” for robust QDA.