selection of data sets for qsars: analyses of tetrahymena toxicity from aromatic compounds
TRANSCRIPT
This article was downloaded by: [University of California Santa Cruz]On: 24 November 2014, At: 15:56Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41Mortimer Street, London W1T 3JH, UK
SAR and QSAR in Environmental ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/gsar20
Selection of data sets for qsars: Analyses of tetrahymenatoxicity from aromatic compoundsT.W. Schultz a , T.I. Netzeva b & M.T.D. Cronin ba College of Veterinary Medicine , The University of Tennessee , 2407 River Drive, Knoxville, TN,379961-4500, USAb School of Pharmacy and Chemistry Liverpool , John Moores University , Byrom Street, Liverpool, L33AF, UKPublished online: 29 Oct 2010.
To cite this article: T.W. Schultz , T.I. Netzeva & M.T.D. Cronin (2003) Selection of data sets for qsars: Analyses of tetrahymenatoxicity from aromatic compounds, SAR and QSAR in Environmental Research, 14:1, 59-81, DOI: 10.1080/1062936021000058782
To link to this article: http://dx.doi.org/10.1080/1062936021000058782
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in thepublications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations orwarranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsedby Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings,demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectlyin connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction,redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expresslyforbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
Invited Paper
SELECTION OF DATA SETS FOR QSARs: ANALYSES OFTETRAHYMENA TOXICITY FROM AROMATIC
COMPOUNDS
T.W. SCHULTZa,*, T.I. NETZEVAb and M.T.D. CRONINb
aThe University of Tennessee, College of Veterinary Medicine, 2407 River Drive, Knoxville, TN379961-4500 USA; bSchool of Pharmacy and Chemistry, Liverpool John Moores University, Byrom
Street, Liverpool, L3 3AF, UK
(Received 19 September 2002; In final form 5 October 2002)
The aim of this investigation was to develop a strategy for the formulation of a valid ecotoxicological-based QSARwhile, at the same time, minimizing the required number of toxicological data points. Two chemical selectionapproaches—distance-based optimality and K Nearest Neighbor (KNN), were used to examine the impact of thenumber of compounds used in the training and testing phases of QSAR development (i.e. diversity and representivity,respectively) on the predictivity (i.e. external validation) of the QSAR. Regression-based QSARs for the ectotoxicpotency for population growth impairment of aromatic compounds (benzenes) to the aquatic ciliate Tetrahymenapyriformis were developed based on descriptors for chemical hydrophobicity and electrophilicity. A ratio of onecompound in the training set to three in the test set was applied. The results indicate that from a known chemicaluniverse, in this case 385 derivatives, robust QSARs of equal quality may be developed from a small number ofdiverse compounds, validated by a representative test set. As a conservative recommendation it is suggested thatthere should be a minimum of 10 observations for each variable in a QSAR.
Keywords: Experimental design; Validation; Response-surface QSAR; Tetrahymena toxicity
INTRODUCTION
There is increasing interest in the using of toxicological-based quantitative structure–activity
relationships (QSARs) as non-animal methods to provide data for priority setting, risk
assessment and chemical classification and labeling [1]. These uses will require development
of new QSARs and validation of the new, as well as existing, QSARs. Both development and
validation efforts may require additional toxicological testing; by conducting preliminary
evaluations prior to the chemicals’ selection process, it may be possible to maximize the
usefulness of the resulting QSAR while, at the same time, minimizing the number of toxicity
tests.
Ecotoxicity QSARs, such as ones developed for aquatic endpoints, are potency-based [2].
They use log-based continuous toxicological data (e.g. LC50 or EC50) and molecular
ISSN 1062-936X print/ISSN 1029-046X online q 2003 Taylor & Francis Ltd
DOI: 10.1080/1062936021000058782
*Corresponding author. Tel.: þ1-865-974-5826. Fax: þ1-865-974-2215. E-mail: [email protected]
SAR and QSAR in Environmental Research, 2003 Vol. 14 (1), pp. 59–81
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
descriptor data (e.g. physicochemical and quantum chemical parameters). These data are
linked together by a statistical method such as regression analysis [3]. Recent work [4–7] has
shown that, as a general tenet, the toxic potency of aromatic molecules for aquatic toxicity
endpoints can be modeled by two factors-hydrophobicity and stereo-electronic effects (i.e.
reactivity, steric hindrance) in fundamental and empirical regression models. The generic
model:
log ðC21Þ ¼ a ðpenetrationÞ þ b ðinteractionÞ þ c ð1Þ
describes such QSARs, the objective of which is to predict the toxic potency of untested
compounds as accurately as possible. The ability to meet this intent, is in large part, a factor
of the datasets on which the QSAR is trained and tested (i.e. validated).
The number and structural heterogeneity of compounds used to train and validate a QSAR
define the domain of that model and thus impact on its applicability. Since the use of a
particular QSAR is only valid within its particular domain [8], defining the depth and breath
of that domain is of great importance. To enhance applicability of a QSAR efforts should be
made to optimize molecular heterogeneity of the tested chemicals and thus maximize the
descriptor domain. However, due to time and resources limitations the generation of bigger
and more varied data sets may not be the most efficient approach. At the same time subtle
alterations in molecular structure can lead to dramatic changes in the mechanism of toxicity
action and potency. One of the best examples of this is the case of the hydroxylated aromatic
compounds, such as catechol and hydroquinone whose toxicity is not well modeled by
general QSARs for hydroxylated aromatics [9]. In each case toxicity is underestimated due to
their propensity to tautomerize to a reactive semiquinone moiety [9]. Therefore, testing a
large number of chemicals exhibiting a gradual change in structure means it is less likely that
an unforeseen misapplication of the QSAR will occur. The net result is that throughout the
development and validation of QSARs one is forced to balance the inherent benefit a larger
database provides with the downside of the cost and time required to generate the extra data.
While it is advisable to have as many observations (i.e. chemicals) in a QSAR as possible,
certain statistical criteria must be met. The so-called rule of Topliss and Costello [10] states
that a minimum of 5 chemicals are required for the inclusion of each descriptor in a QSAR.
Therefore, for the generic QSAR in Eq (1) the training set requires, at minimum, 10
compounds.
Validating a QSAR with external data (i.e. the prediction of the toxicity of compounds that
have not been included in the initial model), while the most demanding, is the best method of
validation. However, there is no consensus on a ratio between training and testing chemicals.
For databases where experimental design has not been applied, the number of chemicals in
the test set varies from a small number of available compounds [5,11], to one third [12] or
one half [13] of the whole database. In the selection and structure of the training set optimal
experimental design techniques undoubtedly offer further reduction of the numbers of
compounds required [14]. At the less conservative end of the spectrum, Brown and Martin
[15] suggested a ratio of 20% of compounds for training of a model and 80% to test it; Matter
[16] suggested 35% for training and 65% for testing. These ratios imply the 1:3 training to
testing ratio is an adequate, and quite acceptable starting point for investigations in this area.
Therefore, if one were to develop a QSAR using a training set of 10 compounds, it would
require data on an additional 30 chemicals for the validation exercise. The issue then
becomes how does one select these 40 chemicals.
For the development and validation of QSARs, regardless of the selection process used,
efforts should be made to ensure the chemicals in the validation set are similar to those of
the training set. Further, it must be ensured that the training set is diverse. While there are
a number of methods to select chemicals for training, and the subsequent validation of
T.W. SCHULTZ et al.60
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
a QSAR, two of the more obvious techniques are to (1) have the tested chemicals represent
the breath, or variety, of all existing chemicals within that domain (i.e. diversity); and (2)
have the tested chemicals represent the depth or distribution of all existing chemicals within
that domain (i.e. representivity). In the former case the test set chemicals would be selected to
maximize the coverage of the descriptor space with as few of chemicals as possible. In the
latter case the chemicals would be selected to mimic the distribution of the existing
chemicals within the descriptor space. These two processes may be considered as being
dissimilarity- and similarity-based selection, respectively.
While the above example leads to a minimum requirement of data for 40 chemicals, it is
not to be implied that testing 40 chemicals would be sufficient to cover the (toxicological)
domain encompassed by all aromatic derivatives. Accepting the hypothesis that increasing
the number of observations in a data set (n ) will not increase the information it contains
ad infinitum, it is, therefore, important to ascertain if a significantly larger data set, for
example, 200 chemicals (i.e. 50 in the training set: 150 in the test set) would provide more
information, and add further value, than a set of 60 chemicals (i.e. 15 training set: 45 test set).
The purpose of the present investigation was to develop a strategy for the formulation of a
valid ecotoxicological-based QSAR while, at the same time, minimizing the required number
of toxicological data points. Two chemical selection approaches—distance-based optimality
and K nearest neighbor (KNN), were used to examine the impact of the number of
compounds used in the training and testing phases of QSAR development (i.e. diversity and
representivity, respectively) on the predictivity (i.e. external validation) of the QSAR. To this
end regression-based QSARs for the ecotoxic potency for population growth impairment to
the aquatic ciliate Tetrahymena pyriformis were developed according to the approach of
Schultz [5] (i.e. development of models based on descriptors for chemical hydrophobicity
and electrophilicity). The inhibition of growth of the ciliated protozoan T. pyriformis
database [17] is considered to be high quality data [18]. The database has been compiled for
QSAR development and validation and includes a wide variety of substituted benzenes.
MATERIALS AND METHODS
Chemicals Tested
Data for 385 commercially obtainable (Aldrich Chemical Co., Milwaukee, Wisconsin, USA;
MTM Research Chemicals or Lancaster Synthesis Inc., Windham, New Hampshire, USA)
substituted-benzenes with purity $ 95% and representing a variety of classes and
mechanisms of toxic action were included in the chemical selection analyses. Derivatives
were confined to a defined, and chemically heterogeneous domain. Specific substructures not
included in these evaluations were carboxylic acids, compounds having the ability to
tautomerize (e.g. catechols) and benzoquinones.
Toxicological Assessment
Toxicity data ðlog IGC2150 Þ were determined in the population growth impairment assay
utilizing T. pyriformis (strain GL-C). Assays were conducted following the protocol
described by Schultz [17] with a 40-h static design and population density measured
spectrophotometrically as the endpoint. The test protocol allows for 8–9 cell cycles in
controls. Following range finding, each chemical was tested in three replicate evaluations.
Two controls were used to provide a measure of the acceptability of the test by indicating the
suitability of the medium and test conditions as well as a basis for interpreting data from
CHEMICAL SELECTION IN QSARs 61
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
other treatment regimes. The first control had no test benzene but was inoculated with
T. pyriformis. The other, a blank, had neither test substance nor inoculum. Each test replicate
consisted of 6–10 different concentrations of each test material with duplicate flasks of each
concentration. Only replicates with control-absorbency values .0.60 but ,0.90 were used
in the analyses. The 50% growth inhibitory concentration (IGC50) was determined by Probit
Analysis of Statistical Analysis System (SAS) software [19] for each benzene evaluated with
the Y-values being absorbencies normalized as percentage of control and the X-values being
the toxicant concentrations in mg/l.
Chemical Descriptors
Hydrophobicity was quantified by the logarithm of the 1-octanol/water partition coefficient
(log Kow) values. The hydrophobicity values were measured or estimated by the ClogP for
Windows version 1.0.0 software (BIOBYTE Corp., Claremont, CA, USA). The acceptor
superdelocalizabilities were determined as a sum of the ratios between the squared
eigenvectors (coefficients) of the i-th atomic orbital in the j-th unoccupied molecular orbital
and the eigenvalue (energy) of the j-th unoccupied molecular orbital, multiplied by two. The
calculations were performed using AM1 method implemented in MOPAC 93. The maximum
acceptor superdelocalisabilities (Amax) were extracted via in-house Macros in MS Word and
Excel.
Statistical Analyses
The number of observations in the training sets for QSAR development was varied from 10 to
50 in increments of 5. Compounds were selected for inclusion in the training set by a
dissimilarity-based procedure, distance-based optimality. Selection of compounds using this
criterion aims to spread the design points (i.e. the chemicals for the training set) uniformly
over the design space (i.e. the chemical universe) without the development of a model in
advance. Hence, diversity is maximized by this approach. The distance-based optimality
algorithm was implemented in the MINITAB (ver. 13.1) software as one of the procedures
for design of optimal experiments (DOE).
In dissimilarity-based methods for selection of optimal subsets, a common algorithm,
introduced by Kennard and Stone [20] was used. Following the lead of Snarey et al., [21] it
included: (1) initialization of the subset by transferring a compound from the database; (2)
calculation of dissimilarity between each of the remaining compounds in the database and
the compounds in the subset; (3) selection of that compound from the database that is most
dissimilar to the subset and its transfer to the subset and (4) return to Step 2 if there are less
than a specified number of compounds in the subset.
The distance-based optimality algorithm selected design points (i.e. chemicals) from a
candidate set, such that the points were spread evenly over the hydrophobicity/electrophi-
licity plane (log Kow and Amax were standardized before analysis in the range 0–1). The
optimization started in each case using all the 385 compounds in the database. The initial
compound (Step 1 above) was recognized as the compound with largest Euclidean distance
from the origin. Then, in Step 3, additional design points were added in a stepwise manner
such that each new point was as far as possible from the points already selected. The
selection of the training set in this manner ensures that the validation exercise (see below)
was performed as a process of interpolation and not extrapolation.
The number of observations in the validations sets for the QSAR selected for each of the
models was three times the number of observations in the corresponding test set (i.e. varied
by increments of 15 from 30 to 150). The compounds in the validation set were selected by
T.W. SCHULTZ et al.62
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
the K-nearest neighbor (KNN) technique in the same hydrophobicity/electrophilicity plane
(standardized in the range 0–1). K-means clustering classifies observations into groups when
the groups are initially unknown by the use of an initial partition column (the compounds,
selected by DOE). The clusters were formed around the pre-selected compounds, each pre-
selected compound being typical for the respective cluster. Thus, the test set was optimized to
mimic the training set. The non-hierarchical KNN clustering technique (according to
Ref. [22]) was implemented in the MINITAB (ver. 13.1) software.
QSARs were developed using the regression procedures of MINITAB version 13.0.
Regression through the origin (as part of the assessment of the predictivity of the test set) was
performed using the SPSS (version 10.0.5) software. In this case, r 2 measures the proportion
of the variability in the dependent variable about the origin explained by regression and
should not be compared to r 2 for models, which include an intercept. The log ðIGC2150 Þ values
reported as mM were used as the independent variable. For the QSARs, the log Kow and Amax
acted as the dependent variables. Resulting models were measured for fit by the adjusted to
the degrees of freedom coefficient of determination ðr2ðadj:ÞÞ. The uncertainty in the model was
noted as square root of the mean square for error (s ), while the predictivity of the model was
noted as the cross-validation r 2 ðr2CVÞ determined by the leave-one-out method. Outliers were
identified as benzenes with standardized residual .3 [23].
RESULTS
The data for hydrophobicity (log Kow) and electrophilicity (Amax) along with toxicity values
(log (IGC2150 Þ) are reported in Table I. Hydrophobicity varied over about six orders of
magnitude (from 20.55 to 5.76 on a log scale). Reactivity measured by Amax varied on a
linear scale from 0.280 to 0.385. Toxicity values varied uniformly over a four-fold range
(from 21.13 to 2.82 on a log scale). Least-squares regression analysis of these data, based on
the two physico-chemical descriptors, yielded the equation (the standard errors of the
coefficients are in parentheses):
log ðIGC2150 Þ ¼ 0:545ð0:015Þ ðlog KowÞ þ 16:21ð0:62Þ ðAmaxÞ2 5:91ð0:20Þ
n ¼ 385; r2ðadjÞ ¼ 0:859; r2
CV ¼ 0:856; s ¼ 0:274; F ¼ 1167; Pr . F ¼ 0:0001: ð2Þ
The QSAR in Eq. (2) was considered as the reference model since it contained all the
toxicological information of the database. A plot of observed toxicity versus that predicted
by Eq. (2) is presented in Fig. 1.
A plot of the hydrophobicity (log Kow) and electrophilicity (Amax) for the complete set of
benzene derivatives, for which toxicological data are available in this study, is presented in
Fig. 2. From the complete database training sets varying in number from 10 to 50 derivatives,
and test set of three times the respective number, were selected. To illustrate the selection
procedure, the chemicals selected for the training set of 15 compounds, and test set of 45, are
highlighted in Fig. 2.
Table II reports the diversity/representivity criteria for the each of the training sets,
selected by DOE and test sets, selected by KNN, respectively. These results indicate that
all test sets were within the physio-chemical space of the respective training set.
The QSARs and summary statistics developed from each of the 9 training sets of varying
number are reported in Table III. In each case, the values of the coefficients on the descriptors
and intercept are similar to those reported in Eq. (2). For the respective test, or validation,
sets, the relationship between toxicity predicted by the relevant QSAR and that observed is
CHEMICAL SELECTION IN QSARs 63
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
Nam
e,C
AS
nu
mb
er,
toxic
ity
and
ph
ysi
co-c
hem
ical
des
crip
tors
of
the
com
po
und
sin
the
dat
ase
t
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
11
87
7-7
7-6
3-A
min
ob
enzy
lal
coh
ol
21
.13
20
.55
0.2
93
(3)
25
34
4-9
0-1
2-A
min
ob
enzy
lal
coh
ol
21
.07
20
.17
0.2
96
31
00
-51
-6B
enzy
lal
coho
l2
0.8
31
.05
0.2
85
(5)
45
01
-94
-04
-Hy
dro
xy
ph
enet
hy
lal
coho
l2
0.8
30
.52
0.3
03
53
54
4-2
5-0
4-A
min
ob
enzy
lcy
anid
e2
0.7
60
.34
0.3
02
66
10
-15
-12
-Nit
rob
enza
mid
e2
0.7
22
0.1
20
.332
(3)
74
98
-00
-04
-Hy
dro
xy
-3-m
eth
ox
yb
enzy
lal
coho
l2
0.7
00
.29
0.3
00
(7)
89
0-0
4-0
2-M
eth
ox
yan
ilin
e2
0.6
91
.18
0.2
95
99
8-8
5-1
(sec
)-P
hen
eth
yl
alco
ho
l2
0.6
61
.42
0.2
85
10
10
8-4
6-3
1,3
-Dih
yd
rox
yb
enze
ne
20
.65
0.8
00
.307
11
14
89
8-8
7-4
1-P
hen
yl-
2-p
rop
anol
20
.62
1.9
70
.283
12
60
-12-8
Ph
enet
hy
lal
coho
l2
0.5
91
.36
0.2
86
13
61
7-9
4-7
2-P
hen
yl-
2-p
rop
anol
20
.57
1.8
10
.288
14
53
22
2-9
2-7
3-A
min
o-2
-cre
sol
20
.55
0.7
00
.301
15
90
-72-2
2,4
,6-T
ris-
(Dim
eth
yla
min
om
eth
yl)
ph
eno
l2
0.5
20
.92
0.3
02
16
58
9-1
8-4
4-M
eth
ylb
enzy
lal
coho
l2
0.4
91
.58
0.2
84
17
93
7-3
9-3
Ph
eny
lace
tic
acid
hy
dra
zid
e2
0.4
80
.14
0.3
19
18
22
37
-30
-13
-Cy
ano
anil
ine
20
.47
1.0
70
.306
(3)
19
98
-86-2
Ace
toph
eno
ne
20
.46
1.6
30
.318
20
89
-95-2
2-M
eth
ylb
enzy
lal
coho
l2
0.4
31
.55
0.2
85
21
93
-54-9
(^)1
-Ph
eny
l-1
-pro
pan
ol
20
.43
1.9
40
.286
22
87
-59-2
2,3
-Dim
eth
yla
nil
ine
20
.43
1.8
10
.294
23
87
-62-7
2,6
-Dim
eth
yla
nil
ine
20
.43
1.8
40
.294
24
10
0-8
6-7
2-M
eth
yl-
1-p
hen
yl-
2-p
rop
anol
20
.41
1.8
60
.285
25
58
9-0
8-2
N-M
eth
ylp
hen
eth
yla
min
e2
0.4
11
.43
0.2
85
26
11
23
-85
-92
-Ph
eny
l-1
-pro
pan
ol
20
.40
1.5
80
.286
27
45
6-4
7-3
3-F
luo
rob
enzy
lal
coh
ol
20
.39
1.2
50
.305
28
14
19
1-9
5-8
4-H
yd
roxy
ben
zyl
cyan
ide
20
.38
0.9
00
.309
29
30
34
-34
-24
-Cy
ano
ben
zam
ide
20
.38
0.4
80
.322
(7)
30
34
8-5
4-9
2-F
luo
roan
ilin
e2
0.3
71
.26
0.3
02
31
10
8-6
9-0
3,5
-Dim
eth
yla
nil
ine
20
.36
1.9
10
.293
32
14
0-2
9-4
Ben
zyl
cyan
ide
20
.36
1.5
60
.294
33
10
8-9
5-2
Ph
eno
l2
0.3
51
.50
0.3
01
34
15
0-1
9-6
3-M
eth
ox
yp
hen
ol
20
.33
1.5
80
.305
35
95
-78-3
2,5
-Dim
eth
yla
nil
ine
20
.33
1.8
30
.294
36
95
-48-7
2-M
eth
ylp
hen
ol
20
.29
1.9
80
.301
37
95
-68-1
2,4
-Mim
eth
yla
nil
ine
20
.29
1.6
80
.293
T.W. SCHULTZ et al.64
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
38
10
8-4
4-1
3-M
eth
yla
nil
ine
20
.28
1.4
00
.294
39
58
2-2
2-9
b-M
eth
ylp
hen
eth
yla
min
e2
0.2
81
.68
0.2
85
(11
)4
06
99
-02
-54
-Met
hy
lph
enet
hy
lal
coho
l2
0.2
61
.68
0.2
85
41
10
0-4
6-9
Ben
zyla
min
e2
0.2
41
.09
0.2
84
42
52
9-1
9-1
2-T
olu
nit
rile
20
.24
2.2
10
.302
43
58
7-0
3-1
3-M
eth
ylb
enzy
lal
coho
l2
0.2
41
.60
0.2
85
44
62
-53-3
An
ilin
e2
0.2
30
.90
0.2
95
(10
)4
55
78
-54
-12
-Eth
yla
nil
ine
20
.22
1.7
40
.295
(7)
46
61
9-2
5-0
3-N
itro
ben
zyl
alco
ho
l2
0.2
21
.21
0.3
15
47
12
2-9
7-4
3-P
hen
yl-
1-p
rop
anol
20
.21
1.8
80
.285
48
100-5
2-7
Ben
zald
ehyde
20
.20
1.4
80
.317
49
12
7-6
6-2
2-P
hen
yl-
3-b
uty
n-2
-ol
20
.18
1.8
80
.300
50
61
8-3
6-0
1-P
hen
yle
thy
lam
ine
20
.18
1.4
00
.285
51
95
-51-2
2-C
hlo
roan
ilin
e2
0.1
71
.88
0.3
04
52
12
00
55
-09
-61
-Ph
eny
l-2
-bu
tano
l2
0.1
62
.02
0.2
84
53
95
-64-7
3,4
-Dim
eth
yla
nil
ine
20
.16
1.8
60
.293
54
95
-53-4
2-M
eth
yla
nil
ine
20
.16
1.4
30
.294
55
10
6-4
4-5
4-M
eth
ylp
hen
ol
20
.16
1.9
70
.300
56
64
5-5
9-0
3-P
hen
ylp
rop
ion
itri
le2
0.1
61
.72
0.2
94
57
62
1-4
2-1
3-A
ceta
mid
op
hen
ol
20
.16
0.7
30
.322
58
15
0-7
6-5
4-M
eth
oxy
ph
eno
l2
0.1
41
.34
0.2
98
59
10
3-7
3-1
Ph
enet
ole
20
.14
2.5
10
.300
60
621-5
9-0
3-H
ydro
xy-4
-met
hoxyben
zald
ehyde
20
.14
0.9
70
.317
(10
)6
11
08
-90
-7C
hlo
rob
enze
ne
20
.13
2.8
40
.311
62
71
-43-2
Ben
zene
20
.12
2.1
30
.280
(3)
63
89
10
4-4
6-1
2-P
hen
yl-
1-b
uta
no
l2
0.1
12
.11
0.2
88
64
622-3
2-2
Ben
zald
oxim
e2
0.1
11
.75
0.2
91
65
10
0-6
6-3
An
iso
le2
0.1
02
.11
0.3
00
66
37
2-1
9-0
3-F
luo
roan
ilin
e2
0.1
01
.30
0.3
07
67
4460-8
6-0
2,4
,5-T
rim
ethoxyben
zald
ehyde
20
.10
1.1
90
.317
68
22
13
5-4
9-5
(S^
)-1
-Ph
eny
l-1
-bu
tano
l2
0.0
92
.47
0.2
86
69
50
0-9
9-2
3,5
-Dim
eth
ox
yp
hen
ol
20
.09
1.6
40
.309
70
10
8-3
9-4
3-M
eth
ylp
hen
ol
20
.08
1.9
80
.300
71
10
4-5
4-1
3-P
hen
yl-
2-p
rop
en-1
-ol
20
.08
1.9
50
.285
72
10
3-0
5-9
a,a
-Dim
eth
ylb
enze
nep
rop
ano
l2
0.0
72
.42
0.2
85
73
93
-55-0
Pro
pio
ph
eno
ne
20
.07
2.1
90
.318
74
91
-23-6
2-N
itro
anis
ole
20
.07
1.7
30
.332
CHEMICAL SELECTION IN QSARs 65
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
75
10
6-4
9-0
4-M
eth
yla
nil
ine
20
.05
1.3
90
.293
76
88
-05-1
2,4
,6-T
rim
ethy
lan
ilin
e2
0.0
52
.31
0.2
93
(11
)7
73
26
1-6
2-9
2-(
4-T
oly
l)-e
thy
lam
ine
20
.04
1.7
80
.284
78
58
7-0
2-0
3-E
thy
lan
ilin
e2
0.0
31
.94
0.2
94
79
12
1-3
3-5
3-M
eth
ox
y-4
-hy
dro
xy
ben
zald
ehy
de
20
.03
1.2
10
.318
80
44
21
-08
-34
-Hy
dro
xy
-3-m
eth
ox
yb
enzo
nit
rile
20
.03
1.4
20
.315
81
45
53
-07
-5E
thy
lp
hen
ylc
yan
oac
etat
e2
0.0
21
.63
0.3
32
82
22
14
4-6
0-1
(R^
)-1
-Ph
eny
l-1
-bu
tano
l2
0.0
12
.47
0.2
86
83
10
4-8
4-7
4-M
eth
yl
ben
zyla
min
e2
0.0
12
.81
0.2
84
84
63
7-5
3-6
Th
ioac
etan
ilid
e2
0.0
11
.71
0.3
41
85
27
22
-36
-33
-Ph
eny
l-1
-bu
tano
l0
.01
2.1
10
.286
86
18
23
-91
-2a
-Met
hy
lben
zyl
cyan
ide
0.0
11
.87
0.2
94
87
62
2-6
2-8
4-E
tho
xyp
hen
ol
0.0
11
.81
0.2
98
88
121-3
2-4
3-E
thoxy-4
-hydro
xyben
zald
ehyde
0.0
21.5
80.3
17
89
37
1-4
1-5
4-F
luo
rop
hen
ol
0.0
21
.77
0.3
07
(11
)9
05
89
-16
-24
-Eth
yla
nil
ine
0.0
31
.96
0.2
94
91
99
-09-2
3-N
itro
anil
ine
0.0
31
.43
0.3
19
92
10
6-4
7-8
4-C
hlo
roan
ilin
e0
.05
1.8
30
.302
93
15
65
-75
-9(^
)-2-P
hen
yl-
2-b
uta
no
l0
.06
2.3
40
.288
94
10
0-4
4-7
Ben
zyl
chlo
rid
e0
.06
2.3
00
.298
95
10
0-6
1-8
N-M
eth
yla
nil
ine
0.0
61
.66
0.2
95
96
76
8-5
9-2
4-E
thy
lben
zyl
alco
ho
l0
.07
2.1
30
.285
97
10
3-6
9-5
N-E
thy
lan
ilin
e0
.07
2.1
60
.295
98
10
8-8
6-1
Bro
mob
enze
ne
0.0
82
.99
0.3
08
99
88
-74-4
2-N
itro
anil
ine
0.0
81
.85
0.3
32
(9)
10
01
82
1-3
9-2
2-P
ropy
lan
ilin
e0
.08
2.4
20
.295
10
11
00
-83
-43
-Hy
dro
xy
ben
zald
ehy
de
0.0
81
.38
0.3
20
10
22
22
7-7
9-4
Th
iob
enza
mid
e0
.09
1.5
00
.339
10
33
50
-46
-91
-Flu
oro
-4-n
itro
ben
zen
e0
.10
1.8
90
.338
10
41
89
82
-54
-22
-Bro
mo
ben
zyl
alco
ho
l0
.10
1.9
70
.309
10
58
74
-90
-84
-Met
ho
xy
ben
zonit
rile
0.1
01
.70
0.3
15
10
61
08
-68
-93
,5-D
imet
hy
lphen
ol
0.1
12
.35
0.3
00
10
79
9-6
1-6
3-N
itro
ben
zald
ehy
de
0.1
11
.47
0.3
32
10
83
36
0-4
1-6
4-P
hen
yl-
1-b
uta
no
l0
.12
2.3
50
.285
10
97
0-7
0-2
40 -
Hyd
rox
yp
rop
iop
hen
on
e0
.12
2.0
30
.318
11
06
43
-28
-72
-iso
-Pro
pyla
nil
ine
0.1
22
.12
0.2
94
11
19
5-6
5-8
3,4
-Dim
eth
ylp
hen
ol
0.1
22
.23
0.2
99
T.W. SCHULTZ et al.66
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
11
25
26
-75
-02
,3-D
imet
hy
lphen
ol
0.1
22
.42
0.3
00
11
39
5-8
8-5
4-C
hlo
rore
sorc
ino
l0
.13
1.8
00
.316
11
41
05
-67
-92
,4-D
imet
hy
lphen
ol
0.1
42
.35
0.3
00
11
51
56
-41
-22
-(4
-Chlo
rop
hen
yl)
-eth
yla
min
e0
.14
2.0
00
.311
11
69
8-9
5-3
Nit
rob
enze
ne
0.1
41
.85
0.3
18
11
79
5-8
7-4
2,5
-Dim
eth
ylp
hen
ol
0.1
42
.34
0.3
00
11
82
04
6-1
8-6
4-P
hen
ylb
uty
ron
itri
le0
.15
2.2
10
.293
11
98
73
-63
-23
-Ch
loro
ben
zyl
alco
ho
l0
.15
1.9
40
.314
12
01
35
-02
-42
-An
isal
deh
yd
e0
.15
1.7
20
.316
(6)
12
19
0-0
0-6
2-E
thy
lph
enol
0.1
62
.47
0.3
01
12
21
04
-86
-94
-Ch
loro
ben
zyla
min
e0
.16
1.8
10
.311
12
37
05
-73
-7(^
)-1-P
hen
yl-
2-p
enta
no
l0
.16
2.5
50
.284
12
44
36
0-4
7-8
Cin
nam
onit
rile
0.1
61
.95
0.3
03
12
55
52
-89
-62
-Nit
rob
enza
ldeh
yd
e0
.17
1.7
40
.333
12
61
00
-68
-5T
hio
anis
ole
0.1
82
.74
0.2
96
12
76
15
-65
-62
-Ch
loro
-4-m
eth
yla
nil
ine
0.1
82
.41
0.3
03
(4)
12
85
36
-60
-74
-iso
-Pro
pylb
enzy
lal
coho
l0
.18
2.5
30
.284
12
96
26
-19
-7P
hen
yl-
1,3
-dia
ldeh
yd
e0
.18
1.3
60
.324
13
03
67
-12
-42
-Flu
oro
ph
enol
0.1
91
.67
0.3
09
13
15
55
-16
-84
-Nit
rob
enza
ldeh
yd
e0
.20
1.5
60
.333
13
21
23
-07
-94
-Eth
ylp
hen
ol
0.2
12
.50
0.3
00
13
34
95
-40
-9B
uty
rop
hen
on
e0
.21
2.7
70
.318
13
49
9-8
8-7
4-i
so-P
rop
yla
nil
ine
0.2
22
.47
0.2
93
13
51
08
-42
-93
-Ch
loro
anil
ine
0.2
21
.88
0.3
12
13
61
00
-10
-74
-(D
imet
hy
lam
ino
)-b
enza
ldeh
yd
e0
.23
1.8
10
.310
13
75
99
1-3
1-1
3-A
nis
ald
ehy
de
0.2
31
.71
0.3
17
13
81
49
3-2
7-2
1-F
luo
ro-2
-nit
rob
enze
ne
0.2
31
.69
0.3
43
13
91
06
-42
-34
-Xy
len
e0
.25
3.1
50
.283
14
01
08
-88
-3T
olu
ene
0.2
52
.73
0.2
84
(10
)1
41
10
4-9
3-8
4-M
eth
yla
nis
ole
0.2
52
.81
0.2
99
14
28
73
-76
-74
-Ch
loro
ben
zyl
alco
ho
l0
.25
1.9
60
.312
14
38
9-8
4-9
2,4
-Dih
yd
rox
yac
eto
ph
eno
ne
0.2
51
.41
0.3
25
14
48
8-7
2-2
2-N
itro
tolu
ene
0.2
62
.30
0.3
17
145
771-6
0-8
Pen
tafl
uoro
anil
ine
0.2
61.8
70.3
38
14
61
00
8-8
9-5
2-P
hen
ylp
yri
din
e0
.27
2.6
30
.299
147
704-1
3-2
3-H
ydro
xy-4
-nit
roben
zald
ehyde
0.2
71.4
70.3
45
14
82
41
6-9
4-6
2,3
,6-T
rim
ethy
lph
eno
l0
.28
2.6
70
.300
CHEMICAL SELECTION IN QSARs 67
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
14
96
20
-17
-73
-Eth
ylp
hen
ol
0.2
92
.50
0.3
00
15
05
79
-66
-82
,6-D
ieth
yla
nil
ine
0.3
12
.87
0.2
95
151
18358-6
3-9
Met
hyl-
4-m
ethyla
min
oben
zoat
e0.3
12.1
60.3
16
15
26
13
-90
-1B
enzo
yl
cyan
ide
0.3
11
.91
0.3
47
15
31
87
5-8
8-3
4-C
hlo
rop
hen
eth
yl
alco
ho
l0
.32
1.9
00
.312
15
41
21
-89
-130 -
Nit
roac
etophen
one
0.3
21.4
20.3
32
15
51
74
5-8
1-9
2-A
lly
lph
enol
0.3
32
.55
0.3
01
156
42454-0
6-8
5-H
ydro
xy-2
-nit
roben
zald
ehyde
0.3
31.7
50.3
36
15
79
5-5
6-7
2-B
rom
op
hen
ol
0.3
32
.33
0.3
14
15
83
64
-74
-92
,5-D
iflu
oro
nit
rob
enze
ne
0.3
31
.86
0.3
49
15
99
5-6
9-2
4-C
hlo
ro-2
-met
hyla
nil
ine
0.3
52
.36
0.3
01
16
06
15
-43
-02
-Io
do
anil
ine
0.3
52
.32
0.3
07
16
16
97
-82
-52
,3,5
-Tri
met
hy
lph
eno
l0
.36
2.9
20
.299
16
25
91
-50
-4Io
do
ben
zen
e0
.36
3.2
50
.301
(9)
16
37
69
-92
-64
-(te
rt)-
Bu
tyla
nil
ine
0.3
62
.70
0.2
93
16
48
9-6
2-3
4-M
eth
yl-
2-n
itro
anil
ine
0.3
71
.82
0.3
31
16
51
19
9-4
6-8
2-A
min
o-4
-(te
rt)-
bu
tylp
hen
ol
0.3
72
.44
0.2
95
16
61
01
-82
-62
-Ben
zylp
yri
din
e0
.38
2.7
10
.296
16
78
7-6
0-5
3-C
hlo
ro-2
-met
hyla
nil
ine
0.3
82
.36
0.3
12
16
89
5-7
4-9
3-C
hlo
ro-4
-met
hyla
nil
ine
0.3
92
.41
0.3
11
16
96
19
-50
-1M
eth
yl-
4-n
itro
ben
zoat
e0
.39
1.9
40
.336
17
01
04
-88
-14
-Ch
loro
ben
zald
ehy
de
0.4
02
.13
0.3
25
17
11
05
21
-91
-25
-Ph
eny
l-1
-pen
tano
l0
.42
2.7
70
.285
172
103-6
3-9
(2-B
rom
oet
hyl)
-ben
zene
0.4
23.0
90.2
94
17
35
27
-60
-62
,4,6
-Tri
met
hy
lph
eno
l0
.42
2.7
30
.299
17
49
9-0
8-1
3-N
itro
tolu
ene
0.4
22
.45
0.3
17
17
59
0-0
2-8
2-H
yd
roxy
ben
zald
ehy
de
0.4
21
.81
0.3
18
17
61
00
-00
-51
-Ch
loro
-4-n
itro
ben
zen
e0
.43
2.3
90
.340
17
75
29
2-4
5-5
Dim
ethy
lnit
rote
rep
hth
alat
e0
.43
1.6
60
.340
17
85
92
2-6
0-1
2-A
min
o-5
-ch
loro
ben
zon
itri
le0
.44
1.7
90
.323
17
96
19
-24
-93
-Nit
rob
enzo
nit
rile
0.4
51
.17
0.3
30
(4)
18
01
06
-38
-74
-Bro
mo
tolu
ene
0.4
73
.50
0.3
06
18
11
00
8-8
8-4
3-P
hen
ylp
yri
din
e0
.47
2.5
30
.296
18
29
9-8
9-8
4-i
so-P
rop
ylp
hen
ol
0.4
72
.90
0.3
00
18
38
77
-65
-64
-(te
rt)-
Bu
tylb
enzy
lal
coho
l0
.48
2.9
30
.285
18
49
1-0
1-0
Ben
zhy
dro
l0
.50
2.6
70
.289
18
59
5-7
9-4
5-C
hlo
ro-2
-met
hyla
nil
ine
0.5
02
.36
0.3
11
T.W. SCHULTZ et al.68
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
18
65
54
-84
-73
-Nit
rop
hen
ol
0.5
12
.00
0.3
24
18
79
5-5
0-1
1,2
-Dic
hlo
rob
enze
ne
0.5
33
.38
0.3
19
188
6361-2
1-3
2-C
hlo
ro-5
-nit
roben
zald
ehyde
0.5
32.2
50.3
57
18
91
06
-48
-94
-Ch
loro
ph
eno
l0
.54
2.3
90
.308
19
05
65
1-8
8-7
Ph
eny
lp
rop
arg
yl
sulfi
de
0.5
43
.30
0.2
98
19
16
15
-74
-72
-Ch
loro
-5-m
eth
ylp
hen
ol
0.5
42
.65
0.3
09
19
25
52
-41
-02
-Hy
dro
xy
-4-m
eth
ox
yac
eto
ph
eno
ne
0.5
51
.98
0.3
24
19
35
54
-00
-72
,4-D
ich
loro
anil
ine
0.5
62
.78
0.3
11
19
48
3-4
1-0
1,2
-Dim
eth
yl-
3-n
itro
ben
zen
e0
.56
2.8
30
.316
19
51
00
9-1
4-9
Val
ero
ph
eno
ne
0.5
63
.17
0.3
18
19
61
19
-33
-54
-Met
hy
l-2
-nit
rop
hen
ol
0.5
72
.15
0.3
33
19
79
5-8
2-9
2,5
-Dic
hlo
roan
ilin
e0
.58
2.7
50
.318
19
81
03
-26
-4tr
an
s-M
eth
yl
cin
nam
ate
0.5
82
.62
0.3
19
19
99
9-5
1-4
1,2
-Dim
eth
yl-
4-n
itro
ben
zen
e0
.59
2.9
10
.314
20
07
12
0-4
3-6
5-C
hlo
ro-2
-hy
dro
xy
ben
zam
ide
0.5
92
.13
0.3
23
20
17
00
-38
-95
-Met
hy
l-2
-nit
rop
hen
ol
0.5
92
.31
0.3
33
20
26
23
-12
-14
-Ch
loro
anis
ole
0.6
02
.79
0.3
07
20
36
62
7-5
5-0
2-B
rom
o-4
-met
hy
lph
enol
0.6
02
.85
0.3
12
204
16532-7
9-9
4-B
rom
ophen
yl
acet
onit
rile
0.6
02.4
30.3
15
(8)
20
54
34
4-5
5-2
4-B
uto
xy
anil
ine
0.6
12
.59
0.2
93
20
63
02
73
-11
-14
-sec
-Bu
tyla
nil
ine
0.6
12
.87
0.2
94
(8)
20
76
18
-45
-13
-iso
-Pro
pylp
hen
ol
0.6
12
.90
0.3
00
20
88
8-6
9-7
2-i
so-P
rop
ylp
hen
ol
0.6
12
.88
0.3
01
20
94
92
0-7
7-8
3-M
eth
yl-
2-n
itro
ph
eno
l0
.61
2.2
90
.332
210
3011-3
4-5
4-H
ydro
xy-3
-nit
roben
zald
ehyde
0.6
11.4
80.3
52
(10)
21
12
97
3-7
6-4
5-B
rom
ovan
illi
n0
.62
1.9
20
.326
21
24
02
-45
-9a
,a,a
-Tri
flu
oro
-4-c
reso
l0
.62
2.8
20
.340
21
32
11
6-6
5-6
4-B
enzy
lpy
ridin
e0
.63
2.6
20
.298
21
46
45
-56
-74
-Pro
py
lph
eno
l0
.64
3.2
00
.300
21
52
70
0-2
2-3
Ben
zyli
den
em
alo
non
itri
le0
.64
2.1
50
.328
21
69
9-9
9-0
4-N
itro
tolu
ene
0.6
52
.37
0.3
15
21
76
26
-01
-73
-Io
do
anil
ine
0.6
52
.90
0.3
02
21
82
49
5-3
7-6
Ben
zyl
met
hac
ryla
te0
.65
2.5
30
.320
21
91
40
-53
-44
-Ch
loro
ben
zyl
cyan
ide
0.6
62
.47
0.3
19
22
05
42
8-5
4-6
2-M
eth
yl-
5-n
itro
ph
eno
l0
.66
2.3
50
.321
22
16
01
-89
-82
-Nit
rore
sorc
ino
l0
.66
1.5
60
.341
22
21
58
5-0
7-5
1-B
rom
o-4
-eth
ylb
enze
ne
0.6
74
.03
0.3
06
CHEMICAL SELECTION IN QSARs 69
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
22
31
22
-03
-24
-iso
-Pro
pylb
enza
ldeh
yde
0.6
72
.92
0.3
16
22
48
8-7
5-5
2-N
itro
ph
eno
l0
.67
1.7
70
.335
22
51
06
-37
-61
,4-D
ibro
mo
ben
zen
e0
.68
3.7
90
.317
22
68
3-4
2-1
2-C
hlo
ro-6
-nit
roto
luen
e0
.68
3.0
90
.329
22
78
8-7
3-3
1-C
hlo
ro-2
-nit
rob
enze
ne
0.6
82
.52
0.3
43
22
81
06
-41
-24
-Bro
mo
ph
enol
0.6
82
.59
0.3
11
22
91
13
7-4
1-3
4-B
enzo
yla
nil
ine
0.6
82
.46
0.3
17
23
09
8-8
2-8
iso
-Pro
pylb
enze
ne
0.6
93
.66
0.2
85
23
11
12
4-0
4-5
2-C
hlo
ro-4
,5-d
imet
hy
lphen
ol
0.6
93
.10
0.3
09
23
21
22
-94
-14
-Bu
tox
yp
hen
ol
0.7
02
.90
0.2
98
23
31
57
0-6
4-5
4-C
hlo
ro-2
-met
hylp
hen
ol
0.7
02
.78
0.3
07
23
46
26
-43
-73
,5-D
ich
loro
anil
ine
0.7
12
.90
0.3
19
23
53
64
36
-65
-42
-Hy
dro
xy
-4,5
-dim
ethy
lace
top
hen
on
e0
.71
2.8
60
.316
23
69
9-7
7-4
Eth
yl-
4-n
itro
ben
zoat
e0
.71
2.3
30
.335
23
75
55
-03
-33
-Nit
roan
iso
le0
.72
2.1
70
.321
23
89
7-0
2-9
2,4
-Din
itro
anil
ine
0.7
21
.72
0.3
61
(3)
23
91
21
-73
-31
-Ch
loro
-3-n
itro
ben
zen
e0
.73
2.4
70
.332
24
08
7-6
5-0
2,6
-Dic
hlo
rop
hen
ol
0.7
32
.64
0.3
20
24
15
85
-34
-23
-ter
t-B
uty
lph
enol
0.7
43
.30
0.3
00
24
22
93
38
-49
-61
,1-D
iph
eny
l-2
-pro
pan
ol
0.7
52
.93
0.2
90
24
31
21
-87
-92
-Ch
loro
-4-n
itro
anil
ine
0.7
52
.05
0.3
36
24
45
77
-19
-51
-Bro
mo
-2-n
itro
ben
zene
0.7
52
.51
0.3
38
(8)
24
59
7-5
4-1
2-M
eth
ox
y-4
-pro
pen
ylp
hen
ol
0.7
53
.31
0.3
02
24
62
97
3-1
9-5
2-C
hlo
rom
eth
yl-
4-n
itro
ph
eno
l0
.75
2.4
20
.338
24
77
80
56
-39
-04
,5-D
iflu
oro
-2-n
itro
anil
ine
0.7
52
.19
0.3
48
24
82
45
44
-04
-52
,6-D
iiso
pro
pyla
nil
ine
0.7
63
.18
0.2
94
24
96
52
62
-96
-63
-Ch
loro
-5-m
eth
ox
yp
hen
ol
0.7
62
.50
0.3
22
25
06
16
-86
-44
-Eth
ox
y-2
-nit
roan
ilin
e0
.76
2.3
90
.326
(3)
25
19
9-6
5-0
1,3
-Din
itro
ben
zen
e0
.76
1.4
90
.345
25
22
35
7-4
7-3
a,a
,a-4
-Tet
rafl
uo
ro-3
-to
luid
ine
0.7
72
.51
0.3
41
25
39
4-3
0-4
Eth
yl-
4-m
eth
oxy
ben
zoat
e0
.77
2.8
10
.319
25
45
34
2-8
7-0
(^
)-1,2
-Dip
hen
yl-
2-p
rop
anol
0.8
03
.23
0.2
90
25
55
9-5
0-7
4-C
hlo
ro-3
-met
hylp
hen
ol
0.8
03
.10
0.3
07
25
63
50
-30
-13
-Ch
loro
-4-fl
uo
ron
itro
ben
zen
e0
.80
2.7
40
.347
(11
)2
57
29
05
-69
-3M
eth
yl-
2,5
-dic
hlo
rob
enzo
ate
0.8
13
.16
0.3
32
25
88
9-5
9-8
4-C
hlo
ro-2
-nit
roto
luen
e0
.82
3.0
50
.328
259
653-3
7-2
Pen
tafl
uoro
ben
zald
ehyde
0.8
22.3
90.3
57
(8)
T.W. SCHULTZ et al.70
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
26
01
45
48
-45
-94
-Bro
mo
ph
eny
l-3
-py
rid
yl
ket
on
e0
.82
2.9
60
.331
26
14
20
87
-80
-9M
eth
yl-
4-c
hlo
ro-2
-nit
rob
enzo
ate
0.8
22
.41
0.3
42
26
21
00
-29
-84
-Nit
rop
hen
eto
le0
.83
2.5
30
.328
26
35
73
-56
-82
,6-D
init
rop
hen
ol
0.8
31
.33
0.3
72
26
46
06
-22
-42
,6-D
init
roan
ilin
e0
.84
1.7
90
.366
26
55
40
-38
-54
-Io
do
ph
eno
l0
.85
2.9
00
.311
266
603-7
1-4
1,3
,5-T
rim
ethyl-
2-n
itro
ben
zene
0.8
63.2
20.3
13
(5)
26
72
43
0-1
6-2
6-P
hen
yl-
1-h
exan
ol
0.8
73
.30
0.2
85
(5)
26
81
08
-43
-03
-Ch
loro
ph
eno
l0
.87
2.5
00
.317
269
119-6
1-9
Ben
zophen
one
0.8
73.1
80.3
21
27
01
08
-70
-31
,3,5
-Tri
chlo
rob
enze
ne
0.8
74
.19
0.3
25
27
11
21
-14
-22
,4-D
init
roto
luen
e0
.87
1.9
80
.345
(5)
27
29
8-5
4-4
4-(
tert
)-B
uty
lph
eno
l0
.91
3.3
10
.300
27
33
59
7-9
1-9
4-B
iphen
ylm
eth
ano
l0
.92
2.9
90
.287
27
45
27
-54
-83
,4,5
-Tri
met
hy
lph
eno
l0
.93
2.8
70
.298
27
51
31
-55
-52
,20 ,4
,40 -
Tet
rah
yd
rox
yb
enzo
ph
eno
ne
0.9
62
.92
0.3
33
27
63
99
05
-50
-54
-Pen
tylo
xy
anil
ine
0.9
73
.12
0.2
93
27
76
11
-06
-32
,4-D
ich
loro
nit
rob
enze
ne
0.9
93
.09
0.3
50
27
81
03
-36
-6(t
ran
s)-
Eth
yl
cin
nam
ate
0.9
92
.99
0.3
18
27
91
13
7-4
2-4
4-B
enzo
ylp
hen
ol
1.0
23
.07
0.3
21
28
01
13
7-4
2-4
4-B
enzo
ylp
hen
ol
1.0
23
.07
0.3
21
28
15
85
-79
-51
-Bro
mo
-3-n
itro
ben
zene
1.0
32
.64
0.3
28
28
21
20
-83
-22
,4-D
ich
loro
ph
eno
l1
.04
3.1
70
.318
28
33
29
-71
-52
,5-D
init
rop
hen
ol
1.0
41
.86
0.3
61
28
48
74
-42
-02
,4-D
ich
loro
ben
zald
ehy
de
1.0
43
.08
0.3
35
28
59
2-5
2-4
Bip
hen
yl
1.0
53
.98
0.2
88
(8)
28
65
1-2
8-5
2,4
-Din
itro
phen
ol
1.0
61
.54
0.3
68
28
71
04
-13
-24
-Bu
tyla
nil
ine
1.0
73
.18
0.2
94
28
89
5-7
5-0
3,4
-Dic
hlo
roto
luen
e1
.07
3.9
50
.318
28
93
20
9-2
2-1
2,3
-Dic
hlo
ron
itro
ben
zen
e1
.07
3.0
50
.350
290
2491-3
2-9
Ben
zyl-
4-h
ydro
xyphen
yl
ket
one
1.0
73.2
20.3
21
29
11
20
-82
-11
,2,4
-Tri
chlo
rob
enze
ne
1.0
84
.02
0.3
26
29
21
41
43
-32
-94
-Ch
loro
-3-e
thy
lph
eno
l1
.08
3.5
10
.308
29
33
81
9-8
8-3
1-F
luo
ro-3
-io
do
-5-n
itro
ben
zen
e1
.09
3.1
50
.335
294
136-3
6-7
Res
orc
inol
monoben
zoat
e1.1
13.1
30.3
30
29
53
53
1-1
9-9
6-C
hlo
ro-2
,4-d
init
roan
ilin
e1
.12
2.4
60
.370
(6)
29
63
21
8-3
6-8
4-B
iphen
ylc
arb
ox
ald
ehyd
e1
.12
3.3
80
.317
CHEMICAL SELECTION IN QSARs 71
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nu
ed
No
.C
AS
Na
me
logðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
uati
on
s *
29
76
18
-62
-23
,5-D
ich
loro
nit
rob
enze
ne
1.1
33
.09
0.3
39
29
88
9-6
1-2
2.5
-Dic
hlo
ronit
rob
enze
ne
1.1
33
.03
0.3
49
29
97
14
9-7
0-4
2-B
rom
o-5
-nit
roto
luen
e1
.16
3.2
50
.334
30
09
9-5
4-7
3,4
-Dic
hlo
ronit
rob
enze
ne
1.1
63
.12
0.3
48
30
11
87
9-0
9-0
6-t
ert-
Bu
tyl-
2,4
-dim
eth
ylp
hen
ol
1.1
64
.30
0.3
00
30
22
37
4-0
5-2
4-B
rom
o-2
,6-d
imet
hy
lphen
ol
1.1
63
.63
0.3
09
30
38
35
-11
-02
,20 -
Dih
yd
rox
yb
enzo
ph
eno
ne
1.1
63
.47
0.3
36
30
41
68
9-8
4-5
3,5
-Dib
rom
o-4
-hy
dro
xy
ben
zon
itri
le1
.16
2.8
80
.341
305
5736-9
1-4
4-(
Pen
tylo
xy)-
ben
zald
ehyde
1.1
83.8
90.3
15
(9)
306
100-1
4-1
4-N
itro
ben
zyl
chlo
ride
1.1
82.4
50.3
23
30
79
42
-92
-7H
exan
oph
eno
ne
1.1
93
.70
0.3
18
30
88
8-0
4-0
4-C
hlo
ro-3
,5-d
imet
hy
lph
enol
1.2
03
.48
0.3
06
30
98
0-4
6-6
4-t
ert-
Pen
tylp
hen
ol
1.2
33
.83
0.3
00
(3)
31
07
77
8-8
3-8
n-P
rop
yl
cin
nam
ate
1.2
33
.52
0.3
18
31
11
81
7-7
3-8
2-B
rom
o-4
,6-d
init
roan
ilin
e1
.24
2.6
10
.372
31
21
04
-51
-8n-B
uty
lben
zen
e1
.25
4.2
60
.284
31
35
28
-29
-01
,2-D
init
rob
enze
ne
1.2
51
.69
0.3
51
31
49
0-9
0-4
4-B
rom
ob
enzo
ph
eno
ne
1.2
64
.12
0.3
26
31
52
68
3-4
3-4
2,4
-Dic
hlo
ro-6
-nit
roan
ilin
e1
.26
3.3
30
.349
316
67-3
6-7
4-P
hen
oxyben
zald
ehyde
1.2
63.9
60.3
17
31
76
10
-78
-64
-Chlo
ro-3
-nit
rop
hen
ol
1.2
72
.46
0.3
39
318
7530-2
7-0
4-B
rom
o-6
-chlo
ro-2
-cre
sol
1.2
83.6
10.3
19
31
96
36
-30
-62
,4,5
-Tri
chlo
roan
ilin
e1
.30
3.6
90
.325
32
01
00
-25
-41
,4-D
init
rob
enze
ne
1.3
01
.47
0.3
47
32
18
6-0
0-0
2-N
itro
bip
hen
yl
1.3
03
.77
0.3
22
32
25
00
-66
-35
-Pen
tylr
eso
rcin
ol
1.3
13
.42
0.3
05
32
35
79
8-7
5-4
Eth
yl-
4-b
rom
ob
enzo
ate
1.3
33
.50
0.3
25
32
41
36
08
-87
-220 ,30 ,40 -
Tri
chlo
roac
eto
ph
eno
ne
1.3
43
.21
0.3
36
(6)
32
59
3-9
9-2
Ph
eny
lb
enzo
ate
1.3
53
.59
0.3
27
32
61
76
96
-62
-7P
hen
yl-
4-h
yd
roxy
ben
zoat
e1
.37
3.4
90
.327
32
73
46
0-1
8-2
2,5
-Dib
rom
on
itro
ben
zen
e1
.37
3.4
10
.346
32
83
99
05
-57
-24
-Hex
ylo
xy
anil
ine
1.3
83
.65
0.2
93
32
96
15
-58
-72
,4-D
ibro
mop
hen
ol
1.4
03
.25
0.3
23
(10
)3
30
88
-06-2
2,4
,6-T
rich
loro
phen
ol
1.4
13
.69
0.3
26
33
11
03
-72
-0P
hen
yl
iso
thio
cyan
ate
1.4
13
.28
0.3
52
(4)
33
21
31
-57
-72
-Hy
dro
xy
-4-m
eth
ox
yben
zop
hen
on
e1
.42
3.5
80
.327
33
31
87
08
-70
-81
,3,5
-Tri
chlo
ro-2
-nit
rob
enze
ne
1.4
33
.69
0.3
54
T.W. SCHULTZ et al.72
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No.
CA
SN
am
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
33
41
20
-51
-4B
enzy
lb
enzo
ate
1.4
53
.97
0.3
21
33
56
52
1-3
0-8
iso
-Am
yl-
4-h
yd
rox
yb
enzo
ate
1.4
83
.97
0.3
20
33
68
44
-51
-92
,5-D
iph
eny
l-1
,4-b
enzo
qu
ino
ne
1.4
83
.16
0.3
32
33
71
34
-85
-04
-Ch
loro
ben
zoph
eno
ne
1.5
03
.97
0.3
25
(4)
33
81
77
00
-09
-31
,2,3
-Tri
chlo
ro-4
-nit
rob
enze
ne
1.5
13
.61
0.3
57
33
98
9-6
9-0
1,2
,4-T
rich
loro
-5-n
itro
ben
zen
e1
.53
3.4
70
.354
34
05
38
-65
-8n-B
uty
lci
nnam
ate
1.5
34
.05
0.3
18
34
11
01
6-7
8-0
3-C
hlo
rob
enzo
ph
eno
ne
1.5
53
.97
0.3
25
34
29
0-6
0-8
3,5
-Dic
hlo
rosa
licy
lald
ehy
de
1.5
53
.07
0.3
34
34
31
67
1-7
5-6
Hep
tano
ph
eno
ne
1.5
64
.23
0.3
18
34
45
91
-35
-53
,5-D
ich
loro
ph
eno
l1
.56
3.6
10
.325
34
56
20
-88
-24
-Nit
rop
hen
yl
ph
eny
let
her
1.5
83
.83
0.3
30
34
68
27
-23
-62
,4-D
ibro
mo
-6-n
itro
anil
ine
1.6
23
.63
0.3
52
34
77
14
7-8
9-9
4-C
hlo
ro-6
-nit
ro-3
-cre
sol
1.6
32
.93
0.3
43
348
771-6
1-9
Pen
tafl
uoro
phen
ol
1.6
33.2
30.3
45
34
91
13
8-5
2-9
3,5
-Di-
tert
-bu
tylp
hen
ol
1.6
45
.13
0.2
98
(9)
35
09
0-5
9-5
3,5
-Dib
rom
osa
licy
lald
ehy
de
1.6
53
.42
0.3
38
35
18
8-3
0-2
3-T
rifl
uo
rom
eth
yl-
4-n
itro
ph
eno
l1
.65
2.7
70
.352
35
26
64
1-6
4-1
4,5
-Dic
hlo
ro-2
-nit
roan
ilin
e1
.66
3.2
10
.345
35
37
0-3
4-8
2,4
-Din
itro
-1-fl
uo
roben
zen
e1
.71
1.4
70
.375
(7)
354
69212-3
1-3
2-(
Ben
zylt
hio
)-3-n
itro
pyri
din
e1.7
23.4
20.3
35
35
55
34
-52
-14
,6-D
init
ro-2
-met
hy
lphen
ol
1.7
32
.12
0.3
66
35
66
09
-89
-22
,4-C
hlo
ro-6
-nit
rop
hen
ol
1.7
53
.07
0.3
54
35
73
48
1-2
0-7
2,3
,5,6
-Tet
rach
loro
anil
ine
1.7
64
.10
0.3
30
35
83
21
7-1
5-0
4-B
rom
o-2
,6-d
ich
loro
ph
eno
l1
.78
3.5
20
.329
35
98
79
-39
-02
,3,4
,5-T
etra
chlo
ron
itro
ben
zen
e1
.78
3.9
30
.361
36
05
38
-68
-1n-A
mylb
enze
ne
1.7
94
.90
0.2
84
(4)
36
11
36
-77
-64
-Hex
ylr
eso
rcin
ol
1.8
03
.45
0.3
06
36
24
09
7-4
9-8
4-(
tert
)-B
uty
l-2
,6-d
init
rop
hen
ol
1.8
03
.61
0.3
67
(5)
36
33
05
-85
-12
,6-I
odo
-4-n
itro
ph
eno
l1
.81
3.5
20
.353
36
41
17
-18
-02
,3,5
,6-T
etra
chlo
ron
itro
ben
zen
e1
.82
4.3
80
.360
(6)
36
53
14
-41
-02
,3,4
,6-T
etra
fluo
ron
itro
ben
zen
e1
.87
1.8
60
.372
36
61
67
4-3
7-9
Oct
anop
hen
on
e1
.89
4.7
50
.318
(6)
36
77
71
-69
-71
,2,3
-Tri
flu
oro
-4-n
itro
ben
zen
e1
.89
2.0
10
.362
36
81
18
-79
-62
,4,6
-Bro
mop
hen
ol
1.9
14
.08
0.3
34
36
96
34
-83
-32
,3,4
,5-T
etra
chlo
roan
ilin
e1
.96
4.2
70
.333
CHEMICAL SELECTION IN QSARs 73
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
TA
BL
EI
–co
nti
nued
No
.C
AS
Nam
elo
gðI
GC2
15
0Þ
log
Kow
Am
ax
Eq
ua
tio
ns *
37
05
70
7-4
4-8
4-E
thy
lbip
hen
yl
1.9
75
.06
0.2
88
37
19
5-9
4-3
1,2
,4,5
-Tet
rach
loro
ben
zen
e2
.00
4.6
30
.331
(9)
37
28
7-8
6-5
Pen
tach
loro
phen
ol
2.0
75
.18
0.3
43
37
39
5-9
5-4
2,4
,5-T
rich
loro
ph
eno
l2
.10
3.7
20
.330
37
47
09
-49
-92
,4-D
init
ro-1
-io
do
ben
zen
e2
.12
2.5
00
.359
37
59
7-0
0-7
1-C
hlo
ro-2
,4-d
init
rob
enze
ne
2.1
62
.14
0.3
74
37
65
8-9
0-2
2,3
,4,6
-Tet
rach
loro
ph
eno
l2
.18
3.8
80
.337
37
76
28
4-8
3-9
1,3
,5-T
rich
loro
-2,4
-din
itro
ben
zen
eh
emih
yd
rate
2.1
92
.97
0.3
85
37
86
30
6-3
9-4
1,2
-Dic
hlo
ro-4
,5-d
init
rob
enze
ne
2.2
12
.93
0.3
65
(11
)3
79
28
68
9-0
8-9
1,5
-Dic
hlo
ro-2
,3-d
init
rob
enze
ne
2.4
22
.85
0.3
69
38
01
04
-40
-5N
ony
lph
eno
l2
.47
5.7
60
.300
(3)
38
15
76
-55
-63
,4,5
,6-T
etra
bro
mo
-2-c
reso
l2
.57
4.9
70
.336
38
22
67
8-2
1-9
1,3
-Din
itro
-2,4
,5-t
rich
loro
ben
zen
e2
.60
3.0
50
.385
(3)
38
36
08
-71
-9P
enta
bro
mo
ph
eno
l2
.66
4.8
50
.346
(3)
38
44
90
1-5
1-3
2,3
,4,5
-Tet
rach
loro
ph
eno
l2
.72
4.2
10
.339
(7)
38
52
00
98
-38
-81
,4-D
init
rote
trac
hlo
rob
enze
ne
2.8
23
.44
0.3
80
*T
he
nota
tion
inth
isco
lum
nin
dic
ates
whic
hco
mpounds
ente
red
the
rele
van
ttr
ainin
gse
t.
T.W. SCHULTZ et al.74
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
reported in Table IV. In each case the slope is close to one and the intercept near zero
indicating a near ideal relationship.
The luxury of having a toxicity database of 385 compounds affords the opportunity to
perform comparisons not normally possible. The relationships between observed and
predicted toxicity for those compounds not within the pre-defined test sets (in other words all
the remaining compounds in the data set which were not part of the test and training sets) are
reported in Table V. These results revealed that in all cases the QSAR was a good predictor of
toxic potency of the remaining derivatives.
DISCUSSION
The determination of the quality of a QSAR is frequently a daunting task. This is due to the
fact that structure–toxicity relationships are estimations of intricate processes, which, for the
most part, are not known in detail [24]. It is obvious that definition of the quality of a QSAR
FIGURE 2 Plot of log Kow against vs. Amax. The values of both descriptors are standardized in the range (0,1).N train ¼ 15 and N test ¼ 45.
FIGURE 1 Plot of observed toxicity vs. toxicity predicted by Eq. (2).
CHEMICAL SELECTION IN QSARs 75
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
goes further than the determination of a highly significant statistical fit. While a high quality
QSAR can only be constructed and validated with high quality biological and physico-
chemical data [8], a high quality QSAR must be applicable to further compounds of interest.
For example, a high quality QSAR can be developed and validated for a congeneric series of
aliphatic alcohols (see Refs. [25–28]). Such hydrophobic-dependent QSARs are transparent
and mechanistic interpretable. However, because of the narrow molecular domain on which
they are founded such QSARs are of limited applicability [29]. Thus, the correct formal
selection of training and validating data is at the heart of the issues of QSAR quality and
applicability.
It is our collective opinion that training sets in QSAR development should be selected so as
to make best use of diversity and thus optimize applicability. The goal in this study was to
investigate the role of the size of the relative data sets required to achieve maximal coverage
of the descriptor space, with as few chemicals as possible. To enable this aim and to allow
applicability of the QSAR, diversity-based selection of the training set ensured that
validation will be interpolation within the physico-chemical domain and not extrapolation
from outside the domain (see Fig. 2 as an example).
Dissimilarity-based procedures such as the DOE methodology applied in this study have
been utilized successfully for the selection of informative training sets in the QSAR analysis
of both pharmacological [30,31] and toxicological [32,33] endpoints. In conjunction with
(fractional) factorial and D-optimal design, they have been used to maximize the volume of
the descriptor space covered by the training set [14]. In compound selection, these
TABLE II Diversity/representativiy criteria for the training set, selected by DOE and test set, selected by KNN
Training set Test set
Ntrain Dave Dmax Gopt Ntest Dave Dmax Gopt Dc-c
10 0.398 0.600 0.663 30 0.275 0.582 0.473 0.13315 0.365 0.607 0.601 45 0.304 0.602 0.505 0.15420 0.354 0.613 0.577 60 0.283 0.580 0.488 0.16725 0.345 0.609 0.567 75 0.285 0.592 0.481 0.14030 0.351 0.584 0.601 90 0.290 0.590 0.492 0.10435 0.336 0.584 0.575 105 0.286 0.589 0.486 0.12040 0.328 0.598 0.548 120 0.273 0.588 0.464 0.15245 0.324 0.603 0.537 135 0.273 0.606 0.450 0.15250 0.322 0.606 0.531 150 0.267 0.614 0.435 0.152
The numbers were derived from standardized (0, 1) variables (Dave—average distance to the centroid of the set, Dmax—maximumdistance to the centroide of the set, Gopt ¼ Dave/Dmax, Dc-c—distance between the centroids of the training and test set).
TABLE III Coefficients (standard error in the parentheses) and statistics of the model TOX ¼ a £ log P þ b £Amax 2 c
Ntrain a b c s r2ðadj:Þ r2
cv F Equation
10 0.604 (0.040) 17.15 (2.45) 26.196 (0.790) 0.244 0.970 0.938 149 (3)15 0.591 (0.038) 16.23 (2.28) 25.900 (0.746) 0.258 0.954 0.927 147 (4)20 0.590 (0.034) 15.30 (1.84) 25.628 (0.596) 0.248 0.952 0.932 189 (5)25 0.574 (0.033) 13.62 (1.75) 25.091 (0.571) 0.259 0.940 0.922 189 (6)30 0.569 (0.036) 15.72 (1.95) 25.704 (0.631) 0.313 0.922 0.906 172 (7)35 0.571 (0.034) 15.07 (1.75) 25.519 (0.571) 0.302 0.915 0.900 184 (8)40 0.561 (0.032) 15.53 (1.73) 25.682 (0.569) 0.306 0.907 0.893 191 (9)45 0.554 (0.030) 15.24 (1.59) 25.557 (0.520) 0.295 0.909 0.897 221 (10)50 0.557 (0.030) 15.82 (1.52) 25.751 (0.492) 0.300 0.905 0.894 236 (11)
The Ntrain is obtained in DOE procedure.
T.W. SCHULTZ et al.76
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
procedures involve the identification of a subset comprising the n most dissimilar molecules
in a database containing N molecules, typically where n ! N [21]. A further technique for
compound selection was described in the recent paper of Golbraikh and Tropsha [34] who
examined the use of sphere-exclusion algorithms for the rational selection of training and test
sets used in the development of QSAR models. One drawback of sphere exclusion
algorithms, that it is not possible to specify the size of the subset, is overcome in the
maximum-dissimilarity algorithms, one of which (distance-based optimality) was used in
this study.
While selection of the training set of chemicals for a QSAR should be based on diversity,
we feel that selection of the chemicals for validation should be based on representivity.
Representivity methods, such as KNN, ensure that the validation chemicals mimic the
distribution of all the chemicals within the descriptor space and thus reflect, and allow, only
interpolation within the training space (see Fig. 2 as an example).
There is no common opinion regarding the optimal distribution of chemicals between the
training and validation subsets. Brown and Martin [15] suggested that n should be
approximately 0.2 N if a subset is to be fully representative of its parent database, while
Matter [16] suggest a value approximately 0.35 N. For the purposes of this investigation
n was selected to be 0.25 N. The selection of the training set to be 25% of the combined
training and test sets, is within the 20–35% range reported in the literature and allows for
easy determination of the number of compounds in the validation subset.
KNN was utilized to obtain a representative test set. One drawback of a clustering
technique such as KNN is that it is not possible to control cluster size. Thus, in selecting
compounds for toxicity testing for validation of a QSAR, “empty” or “exhausted” (i.e. ones
with insufficient compounds) clusters may become apparent. To illustrate this point with an
example, with a 25-cluster scheme based on the data presented in this study, 8 clusters with
more than 20 derivatives were obtained, but also 6 clusters with less than 4 derivatives. The
latter 6 clusters clearly pose a potential problem in the selection of validation chemicals. As
the number of clusters increased so did the number of clusters with relatively few
derivatives. As a practical approach to circumvent the problem of “empty” clusters or those
“exhausted” of compounds, in some cases validation derivatives were selected the cluster
nearest to the “empty” or “exhausted” one. To achieve this, the distances between the
cluster centroids were used to reveal the nearest cluster. If the nearest cluster was also
“empty” or exhausted of derivatives the procedure was repeated to find the nearest cluster
containing derivatives.
The basis of this study was the understanding that the training set should be
maximally diverse and the test set should be representative of the training set in terms of
TABLE IV Statistics of the relationship between observed and calculated toxicity, with intercept and through theorigin for the selected by KNN test set
TOXðobsÞ ¼ a £ TOXðcalcÞ þ b TOXðobsÞ ¼ a £ TOXðcalcÞ
Ntest r 2 Slope Intercept r 2* Slope*
30 0.758 0.977 20.128 0.838 0.90245 0.819 0.959 20.057 0.890 0.92860 0.797 0.973 20.051 0.876 0.94275 0.811 1.059 20.063 0.893 1.01990 0.843 1.042 20.078 0.906 0.995105 0.821 1.042 20.079 0.898 0.991120 0.815 1.016 20.031 0.904 0.996135 0.821 1.031 20.058 0.899 0.992150 0.834 0.999 20.047 0.904 0.969
CHEMICAL SELECTION IN QSARs 77
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
the physico-chemical descriptors (i.e. without reference to toxicity). For quantitative
estimation of how well this goal was met, two criteria—Gopt and Dc-c, as defined in Table II,
were used. Ideally, the training and test set should have a common centroid ðDc–c ¼ 0Þ and
equivalent Gopt values. However, according to the one of the main rules for development of
predictive QSARs, namely that the predictions should be interpolated and not extrapolated
out of the chemical domain of the model [8], it is acceptable for the Gopt (test) to be slightly
lower than the Gopt (train). Since practical application will seldom give the ideal, theoretical
result, it may be assumed that the “best” test set is that which gives the lowest Dc-c distance
and has a Gopt value closest to that of the training set. Quantitative criteria such as Gopt and
Dc-c to assess training and test sets were employed as the KNN method allows the choice of
different chemicals for the validation set. This is especially the case and useful when large
numbers of clusters, many with few compounds, were analyzed (i.e. the case of the 8 clusters
containing more than 20 compounds in the 25-cluster scheme referred to above).
The diversity and representativity criteria for training and test sets are shown in Table II.
For the smallest number of compounds used, N train ¼ 10 and N test ¼ 30; the distance
between the centroids of the training and test set is relatively low. However, these two sets are
characterized by the greatest difference (i.e. about 0.2) between Gopt (train) and Gopt (test).
All other selections of chemicals have lower difference between ðGopt ðtrainÞ2 Gopt ðtestÞÞ
(i.e. about 0.1) and slightly greater Dc-c values. Further examination of these data lead to the
conclusion that for training set sizes more than 15 (i.e. N train ¼ 15; N test ¼ 45) (see Fig. 2)
the training and test sets selected are approximately equivalent in terms of representativity.
In all cases Gopt (test) is lower than Gopt (train), which indicates that the predictions for the
derivatives in the validation subset will not be the result of extrapolation to descriptor space
outside of that covered by the training set. Thus, the information in Table II and Fig. 2
indicates that, despite not being ideal, the methodology used resulted in acceptable selection
of diverse training, and representative, testing sets. Moreover, it shows that the Gopt and Dc-c
criteria can be used successfully for assessment of diversity/representivity in QSAR
development.
The modeling of the toxicity by the DOE-selected training sets of varying sizes resulted in
QSARs similar in terms of coefficients and statistical criteria (Table III). Only two models,
N train ¼ 10 and N train ¼ 25; differ slightly from others in terms of the regression coefficient
on Amax. As the size of the training sets increase, the coefficients for log Kow and Amax
decrease slightly reflecting more closely the QSAR for the complete data set described in
Eq. (2). For all the models there is excellent statistical fit, although it should be noted that the
r 2 remains unrealistically high for most of the models in terms of the likely biological error
in the data. There is a systematic decrease in r 2 of the statistical fit as training set size
increases. However, for assessment of the robustness of the models, the predictivity of the
QSAR, i.e. its ability to predict the toxicity of the compounds in the test set accurately, is of
greater importance.
For assessment of predictivity of the toxicity of the chemicals in the test set, for the models
listed in Table III, two regression procedures were applied [35]. The first (regression with
intercept) provides the actual relationship in terms of a regression equation between observed
and predicted toxicity. However, the simultaneous variation in the slope and the intercept
makes the comparison between models difficult so a second regression procedure, regression
through the origin, (i.e. no intercept) was applied. Despite the fact that the coefficient of
determination in the fit through the origin (i.e. r 2*) can not be compared directly to that for
the fit with intercept (i.e. r 2), the former is more convenient for comparison between the
models, since it accounts for the spread of the toxic potency around the ideal line (i.e. the line
with a slope of 1 and an intercept of 0). In this case the term Slope* accounts for the deviation
of the real regression line from the ideal one.
T.W. SCHULTZ et al.78
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
For the prediction of the toxicity of the test sets selected by the KNN method, the
statistical fit in terms of r 2 values is quite acceptable for that required to predict toxic
potency (Table IV). Increasing the size of the training set greater than N train ¼ 10;appears to provide no significant improvement in the prediction of toxicity.
Further testing on the model predictivity was performed on the compounds, which were
not selected for either the training or the test set (Table V). It should be stressed that this
check of predictivity will not be available when compiling new data (i.e. if this analysis is
performed de novo ). Comparison of the results in Table V with those in Table IV indicates
that the toxicity of the compounds not included in the training and test sets is better predicted
than that for those selected by KNN for the test set (higher r 2 and r 2*). The comparison of
the slopes in the regression through the origin shows that in all the models for the remaining
compounds, the Slope* is lower, which indicates that for those compounds the toxicity will
be over-predicted. However, this may be the more desirable situation, especially for the risk
assessment.
It was evident from the model parameters and validation statistics that there is not
significant improvement (either in statistical fit or predictivity) between the models based on
15–50 derivatives as selected by DOE. However, below 15 compounds in the training set, the
noise, or error in both the toxicological data and chemical descriptors has a bigger influence
on the quality of the model.
In conclusion, training sets for QSAR development should be selected so as to make best
use of diversity, such as by the DOE methodology, so as to optimize applicability. Moreover,
diversity-based selection of the training set ensures that validation will be a process of
interpolation. It is desirable that the selection of validation chemicals is based on the
representivity of the training set since methods, such as KNN, ensure that the validation
chemicals mimic the distribution of the chemicals in the training set within the descriptor
space. Moreover, a training to test ratio of 1: 3 appears adequate (and indeed rigorous) for
validation. This study reveals that with appropriate selection of chemicals it is possible to
maximize the coverage of the descriptor space with a relatively small number of chemicals to
train and validate a high quality QSAR. Specifically to this study, a chemical universe of 385
compounds was represented by 60 chemicals (15 for training and 45 for validation). This
approach thus reduces the time and resources required for testing without reducing the
quality of the QSAR. Despite the fact that the methodology described in this investigation
was based upon a two-descriptor regression model, we believe that it is also applicable to
more multivariate QSAR analyses. From the findings in this paper, as a conservative
recommendation (and building upon the original rule of thumb of Topliss and Costello [10]),
TABLE V Statistics of the relationship between observed and calculated toxicity, with intercept and through theorigin for the derivatives, which do not participate in either the training or test sets
TOXðobsÞ ¼ a £ TOXðcalcÞ þ b TOXðobsÞ ¼ a £ TOXðcalcÞ
Ntest r 2 Slope Intercept r 2* Slope*
345 0.866 0.899 20.088 0.918 0.836325 0.854 0.929 20.091 0.910 0.859305 0.861 0.950 20.084 0.914 0.883285 0.847 0.983 20.080 0.905 0.913265 0.851 0.906 20.083 0.912 0.832245 0.866 0.934 20.084 0.916 0.859225 0.860 0.933 20.044 0.912 0.891205 0.868 0.950 20.076 0.916 0.876185 0.848 0.944 20.064 0.909 0.880
CHEMICAL SELECTION IN QSARs 79
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
it is suggested that there should be a minimum of 10 observations for each variable in a
QSAR. Further investigations are required in this to confirm this finding to other endpoints
(both toxicological and pharmacological) and for other modeling techniques such as PLS and
neural networks.
Acknowledgements
Toxicity data acquisitions were supported in part by a grant from The University of
Tennessee Center of Excellence in Livestock Disease and Human Health. The European
Union IMAGETOX Research Training Network (HPRN-CT-1999-00015) supported
Dr Netzeva. Gratitude is expressed to Mr Glendon Sinks for his assistance with toxicity
analyses.
References
[1] Worth, A.P. and Balls, M. (2002) “Alternative (non-animal) methods for chemical testing: current status andfuture aspects”, ATLA 30(suppl. 1), 1–125.
[2] Walker, J.D. and Schultz, T.W. (2002) “Structure activity relationships for predicting ecological effects ofchemicals”, In: Hoffman, D.J., Rattner, B.A., Burton, Jr., G.A. and Cairns, Jr., J., eds, Handbook ofEcotoxicology 2nd Ed. (CRC Press, Boca Raton), (in press).
[3] Schultz, T.W., Cronin, M.T.D., Walker, J.D. and Aptula, A.O. (2003) “Quantitative structure–activityrelationships (QSARs) in toxicology: a historical perspective”, J. Mol. Struct.—Theochem., (in press).
[4] Karabunarliev, S., Mekenyan, O.G., Karcher, W., Russom, C.L. and Bradbury, S.P. (1996) “Quantum-chemicaldescriptors for estimating the acute toxicity of substituted benzenes to the guppy (Poecilia reticulata ) andfathead minnow (Pimephales promelas )”, Quant. Struct.–Act. Relat. 15, 311–320.
[5] Schultz, T.W. (1999) “Structure–toxicity relationships for benzenes evaluated with Tetrahymena pyriformis”,Chem. Res. Toxicol. 12, 1262–1267.
[6] Cronin, M.T.D. and Schultz, T.W. (2001) “Development of quantitative structure–activity relationships for thetoxicity of aromatic compounds to Tetrahymena pyriformis: Comparative assessment of methodologies”,Chem. Res. Toxicol. 14, 1284–1295.
[7] Seward, J.R., Cronin, M.T.D. and Schultz, T.W. (2001) “Structure–toxicity analyses of Tetrahymena pyriformisexposed to pyridine—an examination into extension of surface-response domains”, SAR QSAR Environ. Res.11, 489–512.
[8] Schultz, T.W. and Cronin, M.T.D. (2003) “Essential and desirable characteristics of ecotoxicity QSARs”,Environ. Toxicol. Chem., In press.
[9] Cronin, M.T.D., Aptula, A.O., Duffy, J.C., Netzeva, T.I., Rowe, P.H. and Valkova, I.V. (2002) “Comparativeassessment of methods to develop QSARs for the prediction of the toxicity of phenols to Tetrahymenapyriformis”, Chemosphere, 49, 1201–1221.
[10] Topliss, J.G. and Costello, J.D. (1972) “Chance correlations in structure–activity studies using multipleregression analysis”, J. Med. Chem. 15, 1066–1069.
[11] Liu, R., Sun, H. and So, S.-S. (2001) “Development of quantitative structure–property relationships models forearly ADME evaluation in drug discovery 2. Blood–brain barrier penetration”, J. Chem. Inf. Comput. Sci. 41,1623–1632.
[12] Cronin, M.T.D., Aptula, A.O., Dearden, J.C., Duffy, J.C., Netzeva, T.I., Patel, H., Rowe, P.H., Schultz, T.W.,Worth, A.P., Voutzoulidis, K. and Schuurmann, G. (2002) “Structure-based classification of antibacterialactivity”, J. Chem. Inf. Comput. Sci. 42, 869–878.
[13] Aptula, A.O., Netzeva, T.I., Valkova, I.V., Cronin, M.T.D., Schultz, T.W., Kuhne, R. and Schuurmann, G.(2002) “Multivariate discrimination between modes of toxic action of phenols”, Quant. Struct.-Act. Relat. 21,12–22.
[14] Eriksson, L. and Johansson, E. (1996) “Multivariate design and modeling in QSAR”, Chemom. Intell. Lab. Syst.34, 1–19.
[15] Brown, R.D. and Martin, Y.C. (1997) “The information content of 2D and 3D structural descriptors relevant toligand-receptor binding”, J. Chem. Inf. Comput. Sci. 37, 1–9.
[16] Matter, H. (1997) “Selecting optimally diverse compound from structure databases: a validation study of two-dimensional and three dimensional molecular descriptors”, J. Med. Chem. 40, 1219–1229.
[17] Schultz, T.W. (1997) “TETRATOX: The Tetrahymena pyriformis population growth impairment endpoint. Asurrogate for fish lethality”, Toxicol. Methods 7, 289–309.
[18] Bradbury, S.P., Russom, C.L., Ankley, G.T., Schultz, T.W. and Walker, J.D. (2003) “QSARs for predictingecological effects of organic chemicals”, Environ. Toxicol. Chem., (In press).
[19] SAS (Statistical Analysis System) Institute, Inc. (1989) SAS/STAT User’s Guide, 4th Ed. Vol. 2, version 6,“SAS Institute Inc.” p 846.
T.W. SCHULTZ et al.80
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014
[20] Kennard, R.W. and Stone, L.A. (1969) “Computer aided designs of experiments”, Technometrics 11, 137–148.[21] Snarey, M., Terrett, N.K., Willet, P. and Wilton, D.J. (1997) “Comparison of algorithms for dissimilarity-based
compound selection”, J. Mol. Graphics Mod. 15, 372–385.[22] Johnson, R.A. and Wichern, D.W. (1998) “Clustering, distance methods, and ordination”, In: Johnson, R.A. and
Wichern, D.W., eds, Applied Multivariate Statistical Analysis (Prentice Hall, New Jersey), pp 726–797.[23] Lipnick, R.L. (1991) “Outliers: their origin and use in the classification of molecular mechanisms of toxicity”,
Sci. Total Environ. 109/110, 131–153.[24] Nendza, M. and Russom, C.L. (1991) “QSAR modeling of the ERL-D fathead minnow acute toxicity
database”, Xenobiotica 21, 147–170.[25] Konemann, H. (1981) “Quantitative structure–activity relationships in fish toxicity studies. Part I: A
relationship for 50 industrial pollutants”, Toxicology 19, 209–221.[26] Veith, G.D., Call, K. and Brooke, L. (1983) “Structure–toxicity relationships for the fathead minnow,
Pimephales promelas: narcotic industrial chemicals”, Can. J. Fish. Aquat. Sci. 40, 743–748.[27] Hansch, C., Kim, D., Leo, A.J., Novellino, E., Silipo, C. and Vittoria, A. (1989) “Toward a quantitative
comparative toxicology of organic compounds”, Crit. Rev. Toxicol. 19, 185–226.[28] Schultz, T.W. and Tichy, M. (1993) “Structure–toxicity relationships for unsaturated alcohols to Tetrahymena
pyriformis: C5 and C6 analogs and primary propargylic alcohols”, Bull. Environ. Contam. Toxicol. 51,681–688.
[29] Kaiser, K.L.E., Dearden, J.C., Klein, W. and Schultz, T.W. (1999) “A note of caution to users of ECOSAR”,Water Qual. Res. J. Can. 34, 179–182.
[30] Norinder, U. and Hogberg, T. (1992) “A quantitative structure–activity relationship for some dopamine D”antagonists of benzamide type”, Acta. Pharm. Nord. 4, 73–78.
[31] Belvisi, L., Bravi, G., Catalano, G., Mabilia, M., Salimbeni, A. and Scolastico, C. (1996) “A 3D QSAR CoMFAstudy of non-peptide angiotensin II receptor antagonists”, J. Comput.-Aided Mol. Des. 10, 567–582.
[32] Blaha, L., Damborsky, J. and Nemec, M. (1998) “QSAR for acute toxicity of saturated and unsaturatedhalogenated aliphatic compounds”, Chemosphere 36, 1345–1365.
[33] Harju, M., Andersson, P.L., Haglund, P. and Tysklind, M. (2002) “Multivariate physicochemicalcharacterization and quantitative structure–property relationship modeling of polybrominated diphenylethers”, Chemosphere 47, 375–384.
[34] Golbraikh, A. and Tropsha, A. (2002) “Predictive QSAR modeling based on diversity sampling of experimentaldatasets for the test and training test selection”, J. Comput.-Aided Mol. Des., (in press).
[35] Golbraikh, A. and Tropsha, A. (2002) “Beware of q2!”, J. Mol. Graphics Mod. 20, 269–276.
CHEMICAL SELECTION IN QSARs 81
Dow
nloa
ded
by [
Uni
vers
ity o
f C
alif
orni
a Sa
nta
Cru
z] a
t 15:
56 2
4 N
ovem
ber
2014