conclusions conclusions - missing values of the principal physico-chemical properties are predicted...
TRANSCRIPT
CONCLUSIONSCONCLUSIONS - Missing values of the principal physico-chemical properties are predicted by validated regression models by
using different kinds of molecular descriptors.
- Ranking of POPs according to their atmospheric mobility tendency is carried out by means of two multivariate
approaches:
a) Principal Component Analysis of predicted physico-chemical data for long range-transport potential;
b) Multicriteria Decision Making, taking also into account the atmospheric half-life and the sorption on the
atmospheric particles, for more general environmental behaviour.
- POP classifications in 4 mobility classes allow the direct and simple assessment of POP environmental
behaviour (for both existing and new chemicals) from the molecular structure only.
Paola Gramatica, Stefano Pozzi, Federica Consolaro and Roberto Todeschini§
QSAR Research Unit, Dep. of Structural and Functional Biology, University of Insubria, via Dunant 3, 21100 Varese (Italy) §Milano Chemometric and QSAR Res. Group, Dep. of Environmental Sciences, University of Milano-Bicocca, via Emanueli 15, 20123 Milano (Italy)
e-mail: [email protected] web-site: http://andromeda.varbio.unimi.it/~QSAR
INTRODUCTIONINTRODUCTION
Persistent Organic Pollutants (POPs), particularly PAH, PCB, polychlorinated dibenzo-p-dioxins and some pesticides, are
organic compounds that bioaccumulate and resist photolytic, chemical or biological degradation.
The discovery that these compounds can move for thousands of kilometres from the point of release, by the so-called
“grasshopper effect” (Figure 1), has shown their global distribution behaviour.
The POPs capable of exhibiting this long-range transport potential are nowadays the focus of various national and
international regulatory initiatives (UNECE, UNEP, etc.) and have been the object of the recent SETAC Pellston Workshop.
POP environmental behaviour is clearly controlled by a variety of physical and chemical processes, that can be analyzed
with the study of properties such as vapour pressure (vp), Henry’s law constant (H), various partition coefficients (Kow,
Koc,...), water solubility (S), atmospheric half-life, etc. Unfortunately, for a large number of POPs, the experimental data for
several properties remain unknown, thus regression models, according to the QSAR/QSPR strategies, need to be developed
to predict the missing values.
A rank of POPs, according to their atmospheric mobility is possible by multivariate approaches (PCA and Multicriteria
Decision Making) based on these predicted data.
A classification of POPs in mobility classes by few molecular descriptors of chemical structure will allow a fast screening
of existing and new compounds.
MOLECULAR DESCRIPTORSMOLECULAR DESCRIPTORS
The QSAR/QSPR approach is applied to predict the values of several physico-chemical properties (lacking for a large
number of POPs) by regression models using different kinds of molecular descriptors for the structural representation of
the studied compounds.
Molecular descriptors represent the way chemical information contained in the molecular structure is transformed and
coded. Among the theoretical descriptors, the best known, obtained simply from the knowledge of the formula, are:
molecular weight and count descriptors (1D-descriptors, i. e. counting of bonds, atoms of different kind, presence or
counting of functional groups and fragments, etc.). Graph-invariant descriptors (2D-descriptors, including both
topological and information indices), are obtained from the knowledge of the molecular topology. WHIM molecular
descriptors [1] contain information about the whole 3D-molecular structure in terms of size, symmetry and atom
distribution. All these indices are calculated [2] from the (x,y,z)-coordinates of a three-dimensional structure of a
molecule, usually from a spatial conformation of minimum energy: 37 non-directional (or global) and 66 directional WHIM
descriptors are obtained. A complete set of about two hundred molecular descriptors has been obtained.
Being our representation of a chemical based on a number of molecular descriptors, an effective variable selection
strategy GA-VSS (Genetic Algorithm - Variable Subset Selection) was applied to the whole set of descriptors in order to
set out the most relevant variables in modelling POP properties by Ordinary Least Squares regression (OLS), maximising
the predictive power (Q2 LOO) [3].
Models with good predictive performances (Q2 LOO = 78-96%) are obtained for all the physico-chemical properties and
the atmospheric half-life, thus achieving reliable data for 87 compounds.
[1] Todeschini R. and Gramatica P.; Quant.Struct.-Act.Relat. 1997, 16, 113-119
[2] Todeschini R. - WHIM-3D / QSAR - Software for the calculation of the WHIM descriptors. rel. 4.1 for Windows, Talete srl, Milan (Italy) 1996. Download: http://www.disat.unimi.it/chm.
[3] Todeschini R. - Moby Digs - Software for Variable Subset Selection by Genetic Algorithms. Rel. 1.0 for Windows, Talete srl, Milan (Italy) 1997.
PRINCIPAL COMPONENT ANALYSISPRINCIPAL COMPONENT ANALYSIS
The biplot of principal component analysis (Figure 2) for 87 POP (Tab. 1), described by the principal physico-chemical
properties (boiling point, melting point, logKow, logKoc, Henry’s law constant, TSA, Vmol, water solubility, vapour pressure)
and the atmospheric half-life, shows a distribution of compounds along the first component (PC1, EV = 70.4%) according to
the global mobility categories assigned by Wania and Mackay[4].
Consequently, it is possible to use the PC1 score values to classify all the 87 POPs in one of the four classes of mobility
(high, relatively high, relatively low and low mobility), defined by the marked cut-off.
This PCA model doesn’t include the atmospheric half-life, because this property is represented only in the second
component, thus it is a model capable to describe the relative long-range transport potential of POPs.
[4] Wania F. and Mackay D., Environ. Sci. Technol., Vol. 30, NO. 9, 1996
BIPLOT PC1-PC2 (CUM EV = 88.8 %, PC1 = 70.4%)
PC1
PC
2
1
2
345
6
789
10111213
14
15
16
17
18
19
20
21
2223
24
2526
27
28 29
30
313233
3435
36
37
38
3940
41
4243
44
4546
47
4849
50
51
5253
5455
56
5758
5960
61
62
63
6465
66
67
68
69
70
71
72
73
7475
76
77
78
79
80
81
8283
84
85
8687
-4
-3
-2
-1
0
1
2
3
4
-10 -8 -6 -4 -2 0 2 4 6 8
high mobilityrelat. highrelat. lowlowmobility unknown
I) High MobilityII) Relat. high MobilityIII) Relat. low mobilityIV) Low mobility
Half-life
Vp
Solub.
logKow
logKoc
mp
bp
Categories of
Wania and Mackay
Some of the variables represented in the first principal component
Not represented in the first PC
FIGURE 2
ID COMPOUND CLA CART KNN RDA ID COMPOUND CLA CART KNN RDA
1 indan 1 1 1 1 45 PCB-4 2 2 2 2
2 naphtalene 1 1 1 1 46 PCB-7 2 2 2 2
3 1-methylnaphthalene 1 1 1 2 47 PCB-9 2 2 2 2
4 2-methylnaphthalene 1 1 1 2 48 PCB-11 2 2 2 2
5 1.2-dimethylnaphthalene 2 2 2 2 49 PCB-12 2 2 2 2
6 1.3-dimethylnaphthalene 2 2 2 2 50 PCB-14 2 2 2 2
7 1.4-dimethylnaphthalene 2 2 2 2 51 PCB-15 2 2 2 2
8 1.5-dimethylnaphthalene 2 2 2 2 52 PCB-18 2 2 2 2
9 2.3-dimethylnaphthalene 2 2 2 2 53 PCB-26 2 2 2 2
10 2-6-dimethylnaphthalene 2 2 2 2 54 PCB-28 2 2 2 2
11 1-ethylnaphthalene 2 2 2 2 55 PCB-29 2 2 2 2
12 2-ethylnaphthalene 2 2 2 2 56 PCB-30 2 2 2 2
13 1.4.5-trimethylnaphthalene 2 2 2 2 57 PCB-33 2 2 2 2
14 acenaphthene 2 2 2 2 58 PCB-40 2 2 2 2
15 acenaphthylene 2 2 2 2 59 PCB-47 2 2 2 2
16 fluorene 2 2 2 2 60 PCB-52 2 2 2 2
17 1-methylfluorene 2 3 2 2 61 PCB-53 2 2 2 2
18 anthracene 3 2 2 2 62 PCB-54 2 2 2 2
19 2-methylanthracene 3 3 3 3 63 PCB-77 2 2 2 2
20 9-methylanthracene 3 3 3 3 64 PCB-86 2 3 2 2
21 9-10-dimethylanthracene 3 3 3 3 65 PCB-87 2 2 2 2
22 phenanthrene 2 3 2 2 66 PCB-101 2 2 2 2
23 1-methylphenanthrene 3 3 3 3 67 PCB-104 2 2 2 2
24 fluoranthene 3 3 3 3 68 PCB-128 2 3 3 2
25 1-2-benzofluorene 3 3 3 3 69 PCB-153 3 3 2 2
26 2-3-benzofluorene 3 3 3 3 70 PCB-155 3 3 3 2
27 pyrene 3 3 3 3 71 PCB-171 2 3 3 3
28 chrysene 3 3 3 4 72 PCB-194 3 3 3 3
29 triphenylene 3 3 3 4 73 PCB-202 2 3 3 3
30 naphthacene 4 3 3 3 74 PCB-206 3 3 3 3
31 benz[a]anthracene 3 3 3 4 75 PCB-209 3 2 3 3
32 benzo[b]fluoranthene 4 4 4 4 76 a -HCH 1 1 1 1
33 benzo[j]fluoranthene 4 4 4 4 77 y-HCH 1 1 1 1
34 benzo[a]pyrene 4 4 4 4 78 p,p'-DDT 3 3 2 3
35 benzo[e]pyrene 4 4 4 4 79 p,p'-DDE 3 3 3 3
36 perylene 4 4 4 4 80 p,p'-DDD 3 3 3 3
377.12-dimethylbenz[a]anthracene 4 4 4 4 81 chlordane 3 3 2 3
38 3-methylcholanthrene 4 4 4 4 82 dieldrin 3 3 2 3
39 benzo[g,h,i]perylene 4 4 4 4 83 2,3,7,8-tetraCl-dibenzo-p -dioxin 3 2 3 3
40 dibenz[a,h]anthracene 4 4 3 4 84 1,2,3,4,7,8-hexaCl-dibenzo-p -dioxin 3 3 3 3
41 biphenyl (PCB-0) 1 1 2 2 85 octaCl-dibenzo-p -dioxin 3 3 3 3
42 PCB-1 1 2 2 2 86 pentaCl-benzene 1 1 1 1
43 PCB-2 2 2 2 2 87 hexaCl-benzene 1 1 1 1
44 PCB-3 2 2 2 2
““DESIRABILITY” OF POPDESIRABILITY” OF POPs s ACCORDING TO THEIR ACCORDING TO THEIR ATMOSPHERIC MOBILITYATMOSPHERIC MOBILITY
The main goal of this work is a suggestion of POP ranking according to their atmospheric mobility.
In addition to the long-range transport potential, modelled by PC1, the compound half-life and their sorption on the
atmospheric particles must be considered because of their influence on the atmospheric mobility. A chemometric strategy
known as “Multicriteria Decision Making“, particularly the desirability functions, was used for these purposes.
Less mobile POPs are considered in this work as more desirable. Thus we apply the following criteria:• first principal component (PC1) score values as mobility index: optimum = low values • logKoc as atmospheric particle sorption index: optimum = high values• atmospheric half-life values: optimum = low values
The desirability values for each compounds were calculated by a linear function.
In figure 3 the desirability values were plotted as a function of the molecule ID; the compound mobility trend appears very
similar to the real world distribution.
TABLE 1
11
33 44
66
22
High mobilityHigh mobility
Relat. high mobilityRelat. high mobility
Relat. Low mobilityRelat. Low mobility
Low mobilityLow mobility
GRASSHOPPER EFFECTGRASSHOPPER EFFECT
DDTDDTHCBHCB
FIGURE 1
Selected molecular descriptors
C = count descriptorsT = topological descriptorsW-DIR = directional WHIM descriptorsW-ND = no directional WHIM descriptors
NAT= number of atoms (C)NBO = number of bonds (C) CHI0 = Randic chi-0 (T) CHI1A = Randic chi-1 (average) (T) GSI = Gordon-Scatlebury index (T)BAL = Balaban index (T) IAC = index of atomic composition (T)IDDE = total index on equivalence of degrees (T)DELS = total electrotopological difference (T)ROUV = Rouvray index (T)MAXDP = maximum electrotop. difference (T)MW = molecular weight NCl = number of Cl (C)L1m = dimension along the first component with atomic mass weight (W-DIR)L2s = size along the second component with electrotopological weight (W-DIR)E1u, E3u = density along respectively the first and the third dimension with unit weight (W-DIR)P2u = shape along the second component with unit weight (W-DIR)P1v = shape along the first component with van der Waals volume weight (W-DIR)Ts, Tm = size (eigenvalue sum) with respectively atomic mass and electrotopological weight (W-ND)Av = size (cross-term eigenvalue sum) with van der Waals volume weight (W-ND)Vu = size (complete eigenvalue expression) with unit weight (W-ND)
LDALDA MOLECULAR MOLECULAR DESCRIPTORS:DESCRIPTORS:
CHI0 IAC DELS ROUV CHI0 IAC DELS ROUV MAXDP MW NCl L1m E3u P1v MAXDP MW NCl L1m E3u P1v
L2s Ts Tm Vu AvL2s Ts Tm Vu Av
CART (162 DESCRIPTORS)CART (162 DESCRIPTORS)
MRMRcv cv = 12.64 %= 12.64 %
= 0.5 ; = 0.5 ; = 0.0 = 0.0
MRMRcvcv = 14.94 % = 14.94 %
K = 3K = 3
MRMRcvcv = 13.79 % = 13.79 %
RDARDA
KNNKNN
4234231231
assigned class
GSI 25.50
NBO 12.50
CHI1A 0.46
BAL 1.71
NAT 31.00
P2u 0.17
IDDE 2.84
BAL 1.49
E1u 0.55
Classification Tree
CLASSIFICATIONSCLASSIFICATIONS
A POP classification according to their environmental behaviour was made by means of several classification methods
(CART, K-NN and RDA). The a priori classes were obtained from the desirability values: high mobility (class 1) 0.33,
relatively high (cla 2) = [0.33- 0.5], relatively low (cla 3) = [0.5- 0.67], low mobility (cla 4) > 0.67.
All the classification methods give models with satisfactory prediction power (results below). The simplest model, and
consequently the most directly applicable, is developed with CART (figure 4): the selected descriptors are mainly related
to the molecular size.
In Tab. 1 are reported the a priori classes (cla) and the predicted classes by each model for all POPs: most of the
compounds have been assigned to the same class by all the applied classification methods, only compounds at the
border of two contiguous classes have a different classification. Nevertheless, it must be noted that no compounds have
been assigned to not-adjacent classes and the cut-off values are not strictly definable.
CLASSIFICATIONCLASSIFICATION
NOMMR = 63.22 %NOMMR = 63.22 %
FIGURE 4
55
ID
De
sir
ab
ilit
y
1
234
56789101112
13
1415
16
17
181920
21
22
2324
2526
27
2829
30
31
3233
3435
36
37
38
3940
4142
4344454647
484950
51
5253545556
575859
6061626364656667
68
697071
72
73
7475
7677
78
79
8081
8283
8485
8687
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-1 19 39 59 79
Chemical classPAHPCBPesticideDioxinBenzene
FIGURE 3
High High mobilitymobility
Low Low mobilitymobility