selection of data sets for qsars: analyses of tetrahymena toxicity from aromatic compounds

This article was downloaded by: [University of California Santa Cruz]On: 24 November 2014, At: 15:56Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41Mortimer Street, London W1T 3JH, UK

SAR and QSAR in Environmental ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/gsar20

Selection of data sets for qsars: Analyses of tetrahymenatoxicity from aromatic compoundsT.W. Schultz a , T.I. Netzeva b & M.T.D. Cronin ba College of Veterinary Medicine , The University of Tennessee , 2407 River Drive, Knoxville, TN,379961-4500, USAb School of Pharmacy and Chemistry Liverpool , John Moores University , Byrom Street, Liverpool, L33AF, UKPublished online: 29 Oct 2010.

To cite this article: T.W. Schultz , T.I. Netzeva & M.T.D. Cronin (2003) Selection of data sets for qsars: Analyses of tetrahymenatoxicity from aromatic compounds, SAR and QSAR in Environmental Research, 14:1, 59-81, DOI: 10.1080/1062936021000058782

To link to this article: http://dx.doi.org/10.1080/1062936021000058782

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in thepublications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations orwarranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsedby Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings,demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectlyin connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction,redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expresslyforbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/gsar20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/1062936021000058782

http://dx.doi.org/10.1080/1062936021000058782

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Invited Paper

SELECTION OF DATA SETS FOR QSARs: ANALYSES OFTETRAHYMENA TOXICITY FROM AROMATIC

COMPOUNDS

T.W. SCHULTZa,*, T.I. NETZEVAb and M.T.D. CRONINb

aThe University of Tennessee, College of Veterinary Medicine, 2407 River Drive, Knoxville, TN379961-4500 USA; bSchool of Pharmacy and Chemistry, Liverpool John Moores University, Byrom

Street, Liverpool, L3 3AF, UK

(Received 19 September 2002; In final form 5 October 2002)

The aim of this investigation was to develop a strategy for the formulation of a valid ecotoxicological-based QSARwhile, at the same time, minimizing the required number of toxicological data points. Two chemical selectionapproaches—distance-based optimality and K Nearest Neighbor (KNN), were used to examine the impact of thenumber of compounds used in the training and testing phases of QSAR development (i.e. diversity and representivity,respectively) on the predictivity (i.e. external validation) of the QSAR. Regression-based QSARs for the ectotoxicpotency for population growth impairment of aromatic compounds (benzenes) to the aquatic ciliate Tetrahymenapyriformis were developed based on descriptors for chemical hydrophobicity and electrophilicity. A ratio of onecompound in the training set to three in the test set was applied. The results indicate that from a known chemicaluniverse, in this case 385 derivatives, robust QSARs of equal quality may be developed from a small number ofdiverse compounds, validated by a representative test set. As a conservative recommendation it is suggested thatthere should be a minimum of 10 observations for each variable in a QSAR.

Keywords: Experimental design; Validation; Response-surface QSAR; Tetrahymena toxicity

INTRODUCTION

There is increasing interest in the using of toxicological-based quantitative structure–activity

relationships (QSARs) as non-animal methods to provide data for priority setting, risk

assessment and chemical classification and labeling [1]. These uses will require development

of new QSARs and validation of the new, as well as existing, QSARs. Both development and

validation efforts may require additional toxicological testing; by conducting preliminary

evaluations prior to the chemicals’ selection process, it may be possible to maximize the

usefulness of the resulting QSAR while, at the same time, minimizing the number of toxicity

tests.

Ecotoxicity QSARs, such as ones developed for aquatic endpoints, are potency-based [2].

They use log-based continuous toxicological data (e.g. LC50 or EC50) and molecular

ISSN 1062-936X print/ISSN 1029-046X online q 2003 Taylor & Francis Ltd

DOI: 10.1080/1062936021000058782

*Corresponding author. Tel.: þ1-865-974-5826. Fax: þ1-865-974-2215. E-mail: [email protected]

SAR and QSAR in Environmental Research, 2003 Vol. 14 (1), pp. 59–81

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

descriptor data (e.g. physicochemical and quantum chemical parameters). These data are

linked together by a statistical method such as regression analysis [3]. Recent work [4–7] has

shown that, as a general tenet, the toxic potency of aromatic molecules for aquatic toxicity

endpoints can be modeled by two factors-hydrophobicity and stereo-electronic effects (i.e.

reactivity, steric hindrance) in fundamental and empirical regression models. The generic

model:

log ðC21Þ ¼ a ðpenetrationÞ þ b ðinteractionÞ þ c ð1Þ

describes such QSARs, the objective of which is to predict the toxic potency of untested

compounds as accurately as possible. The ability to meet this intent, is in large part, a factor

of the datasets on which the QSAR is trained and tested (i.e. validated).

The number and structural heterogeneity of compounds used to train and validate a QSAR

define the domain of that model and thus impact on its applicability. Since the use of a

particular QSAR is only valid within its particular domain [8], defining the depth and breath

of that domain is of great importance. To enhance applicability of a QSAR efforts should be

made to optimize molecular heterogeneity of the tested chemicals and thus maximize the

descriptor domain. However, due to time and resources limitations the generation of bigger

and more varied data sets may not be the most efficient approach. At the same time subtle

alterations in molecular structure can lead to dramatic changes in the mechanism of toxicity

action and potency. One of the best examples of this is the case of the hydroxylated aromatic

compounds, such as catechol and hydroquinone whose toxicity is not well modeled by

general QSARs for hydroxylated aromatics [9]. In each case toxicity is underestimated due to

their propensity to tautomerize to a reactive semiquinone moiety [9]. Therefore, testing a

large number of chemicals exhibiting a gradual change in structure means it is less likely that

an unforeseen misapplication of the QSAR will occur. The net result is that throughout the

development and validation of QSARs one is forced to balance the inherent benefit a larger

database provides with the downside of the cost and time required to generate the extra data.

While it is advisable to have as many observations (i.e. chemicals) in a QSAR as possible,

certain statistical criteria must be met. The so-called rule of Topliss and Costello [10] states

that a minimum of 5 chemicals are required for the inclusion of each descriptor in a QSAR.

Therefore, for the generic QSAR in Eq (1) the training set requires, at minimum, 10

compounds.

Validating a QSAR with external data (i.e. the prediction of the toxicity of compounds that

have not been included in the initial model), while the most demanding, is the best method of

validation. However, there is no consensus on a ratio between training and testing chemicals.

For databases where experimental design has not been applied, the number of chemicals in

the test set varies from a small number of available compounds [5,11], to one third [12] or

one half [13] of the whole database. In the selection and structure of the training set optimal

experimental design techniques undoubtedly offer further reduction of the numbers of

compounds required [14]. At the less conservative end of the spectrum, Brown and Martin

[15] suggested a ratio of 20% of compounds for training of a model and 80% to test it; Matter

[16] suggested 35% for training and 65% for testing. These ratios imply the 1:3 training to

testing ratio is an adequate, and quite acceptable starting point for investigations in this area.

Therefore, if one were to develop a QSAR using a training set of 10 compounds, it would

require data on an additional 30 chemicals for the validation exercise. The issue then

becomes how does one select these 40 chemicals.

For the development and validation of QSARs, regardless of the selection process used,

efforts should be made to ensure the chemicals in the validation set are similar to those of

the training set. Further, it must be ensured that the training set is diverse. While there are

a number of methods to select chemicals for training, and the subsequent validation of

T.W. SCHULTZ et al.60

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

a QSAR, two of the more obvious techniques are to (1) have the tested chemicals represent

the breath, or variety, of all existing chemicals within that domain (i.e. diversity); and (2)

have the tested chemicals represent the depth or distribution of all existing chemicals within

that domain (i.e. representivity). In the former case the test set chemicals would be selected to

maximize the coverage of the descriptor space with as few of chemicals as possible. In the

latter case the chemicals would be selected to mimic the distribution of the existing

chemicals within the descriptor space. These two processes may be considered as being

dissimilarity- and similarity-based selection, respectively.

While the above example leads to a minimum requirement of data for 40 chemicals, it is

not to be implied that testing 40 chemicals would be sufficient to cover the (toxicological)

domain encompassed by all aromatic derivatives. Accepting the hypothesis that increasing

the number of observations in a data set (n ) will not increase the information it contains

ad infinitum, it is, therefore, important to ascertain if a significantly larger data set, for

example, 200 chemicals (i.e. 50 in the training set: 150 in the test set) would provide more

information, and add further value, than a set of 60 chemicals (i.e. 15 training set: 45 test set).

The purpose of the present investigation was to develop a strategy for the formulation of a

valid ecotoxicological-based QSAR while, at the same time, minimizing the required number

of toxicological data points. Two chemical selection approaches—distance-based optimality

and K nearest neighbor (KNN), were used to examine the impact of the number of

compounds used in the training and testing phases of QSAR development (i.e. diversity and

representivity, respectively) on the predictivity (i.e. external validation) of the QSAR. To this

end regression-based QSARs for the ecotoxic potency for population growth impairment to

the aquatic ciliate Tetrahymena pyriformis were developed according to the approach of

Schultz [5] (i.e. development of models based on descriptors for chemical hydrophobicity

and electrophilicity). The inhibition of growth of the ciliated protozoan T. pyriformis

database [17] is considered to be high quality data [18]. The database has been compiled for

QSAR development and validation and includes a wide variety of substituted benzenes.

MATERIALS AND METHODS

Chemicals Tested

Data for 385 commercially obtainable (Aldrich Chemical Co., Milwaukee, Wisconsin, USA;

MTM Research Chemicals or Lancaster Synthesis Inc., Windham, New Hampshire, USA)

substituted-benzenes with purity $ 95% and representing a variety of classes and

mechanisms of toxic action were included in the chemical selection analyses. Derivatives

were confined to a defined, and chemically heterogeneous domain. Specific substructures not

included in these evaluations were carboxylic acids, compounds having the ability to

tautomerize (e.g. catechols) and benzoquinones.

Toxicological Assessment

Toxicity data ðlog IGC2150 Þ were determined in the population growth impairment assay

utilizing T. pyriformis (strain GL-C). Assays were conducted following the protocol

described by Schultz [17] with a 40-h static design and population density measured

spectrophotometrically as the endpoint. The test protocol allows for 8–9 cell cycles in

controls. Following range finding, each chemical was tested in three replicate evaluations.

Two controls were used to provide a measure of the acceptability of the test by indicating the

suitability of the medium and test conditions as well as a basis for interpreting data from

CHEMICAL SELECTION IN QSARs 61

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

other treatment regimes. The first control had no test benzene but was inoculated with

T. pyriformis. The other, a blank, had neither test substance nor inoculum. Each test replicate

consisted of 6–10 different concentrations of each test material with duplicate flasks of each

concentration. Only replicates with control-absorbency values .0.60 but ,0.90 were used

in the analyses. The 50% growth inhibitory concentration (IGC50) was determined by Probit

Analysis of Statistical Analysis System (SAS) software [19] for each benzene evaluated with

the Y-values being absorbencies normalized as percentage of control and the X-values being

the toxicant concentrations in mg/l.

Chemical Descriptors

Hydrophobicity was quantified by the logarithm of the 1-octanol/water partition coefficient

(log Kow) values. The hydrophobicity values were measured or estimated by the ClogP for

Windows version 1.0.0 software (BIOBYTE Corp., Claremont, CA, USA). The acceptor

superdelocalizabilities were determined as a sum of the ratios between the squared

eigenvectors (coefficients) of the i-th atomic orbital in the j-th unoccupied molecular orbital

and the eigenvalue (energy) of the j-th unoccupied molecular orbital, multiplied by two. The

calculations were performed using AM1 method implemented in MOPAC 93. The maximum

acceptor superdelocalisabilities (Amax) were extracted via in-house Macros in MS Word and

Excel.

Statistical Analyses

The number of observations in the training sets for QSAR development was varied from 10 to

50 in increments of 5. Compounds were selected for inclusion in the training set by a

dissimilarity-based procedure, distance-based optimality. Selection of compounds using this

criterion aims to spread the design points (i.e. the chemicals for the training set) uniformly

over the design space (i.e. the chemical universe) without the development of a model in

advance. Hence, diversity is maximized by this approach. The distance-based optimality

algorithm was implemented in the MINITAB (ver. 13.1) software as one of the procedures

for design of optimal experiments (DOE).

In dissimilarity-based methods for selection of optimal subsets, a common algorithm,

introduced by Kennard and Stone [20] was used. Following the lead of Snarey et al., [21] it

included: (1) initialization of the subset by transferring a compound from the database; (2)

calculation of dissimilarity between each of the remaining compounds in the database and

the compounds in the subset; (3) selection of that compound from the database that is most

dissimilar to the subset and its transfer to the subset and (4) return to Step 2 if there are less

than a specified number of compounds in the subset.

The distance-based optimality algorithm selected design points (i.e. chemicals) from a

candidate set, such that the points were spread evenly over the hydrophobicity/electrophi-

licity plane (log Kow and Amax were standardized before analysis in the range 0–1). The

optimization started in each case using all the 385 compounds in the database. The initial

compound (Step 1 above) was recognized as the compound with largest Euclidean distance

from the origin. Then, in Step 3, additional design points were added in a stepwise manner

such that each new point was as far as possible from the points already selected. The

selection of the training set in this manner ensures that the validation exercise (see below)

was performed as a process of interpolation and not extrapolation.

The number of observations in the validations sets for the QSAR selected for each of the

models was three times the number of observations in the corresponding test set (i.e. varied

by increments of 15 from 30 to 150). The compounds in the validation set were selected by


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

the K-nearest neighbor (KNN) technique in the same hydrophobicity/electrophilicity plane

(standardized in the range 0–1). K-means clustering classifies observations into groups when

the groups are initially unknown by the use of an initial partition column (the compounds,

selected by DOE). The clusters were formed around the pre-selected compounds, each pre-

selected compound being typical for the respective cluster. Thus, the test set was optimized to

mimic the training set. The non-hierarchical KNN clustering technique (according to

Ref. [22]) was implemented in the MINITAB (ver. 13.1) software.

QSARs were developed using the regression procedures of MINITAB version 13.0.

Regression through the origin (as part of the assessment of the predictivity of the test set) was

performed using the SPSS (version 10.0.5) software. In this case, r 2 measures the proportion

of the variability in the dependent variable about the origin explained by regression and

should not be compared to r 2 for models, which include an intercept. The log ðIGC2150 Þ values

reported as mM were used as the independent variable. For the QSARs, the log Kow and Amax

acted as the dependent variables. Resulting models were measured for fit by the adjusted to

the degrees of freedom coefficient of determination ðr2ðadj:ÞÞ. The uncertainty in the model was

noted as square root of the mean square for error (s ), while the predictivity of the model was

noted as the cross-validation r 2 ðr2CVÞ determined by the leave-one-out method. Outliers were

identified as benzenes with standardized residual .3 [23].

RESULTS

The data for hydrophobicity (log Kow) and electrophilicity (Amax) along with toxicity values

(log (IGC2150 Þ) are reported in Table I. Hydrophobicity varied over about six orders of

magnitude (from 20.55 to 5.76 on a log scale). Reactivity measured by Amax varied on a

linear scale from 0.280 to 0.385. Toxicity values varied uniformly over a four-fold range

(from 21.13 to 2.82 on a log scale). Least-squares regression analysis of these data, based on

the two physico-chemical descriptors, yielded the equation (the standard errors of the

coefficients are in parentheses):

log ðIGC2150 Þ ¼ 0:545ð0:015Þ ðlog KowÞ þ 16:21ð0:62Þ ðAmaxÞ2 5:91ð0:20Þ

n ¼ 385; r2ðadjÞ ¼ 0:859; r2

CV ¼ 0:856; s ¼ 0:274; F ¼ 1167; Pr . F ¼ 0:0001: ð2Þ

The QSAR in Eq. (2) was considered as the reference model since it contained all the

toxicological information of the database. A plot of observed toxicity versus that predicted

by Eq. (2) is presented in Fig. 1.

A plot of the hydrophobicity (log Kow) and electrophilicity (Amax) for the complete set of

benzene derivatives, for which toxicological data are available in this study, is presented in

Fig. 2. From the complete database training sets varying in number from 10 to 50 derivatives,

and test set of three times the respective number, were selected. To illustrate the selection

procedure, the chemicals selected for the training set of 15 compounds, and test set of 45, are

highlighted in Fig. 2.

Table II reports the diversity/representivity criteria for the each of the training sets,

selected by DOE and test sets, selected by KNN, respectively. These results indicate that

all test sets were within the physio-chemical space of the respective training set.

The QSARs and summary statistics developed from each of the 9 training sets of varying

number are reported in Table III. In each case, the values of the coefficients on the descriptors

and intercept are similar to those reported in Eq. (2). For the respective test, or validation,

sets, the relationship between toxicity predicted by the relevant QSAR and that observed is


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

Nam

e,C

AS

nu

mb

er,

toxic

ity

and

ph

ysi

co-c

hem

ical

des

crip

tors

of

the

com

po

und

sin

the

dat

ase

t

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

11

87

7-7

7-6

3-A

min

ob

enzy

lal

coh

ol

21

.13

20

.55

0.2

93

(3)

25

34

4-9

0-1

2-A

min

ob

enzy

lal

coh

ol

21

.07

20

.17

0.2

96

31

00

-51

-6B

enzy

lal

coho

l2

0.8

31

.05

0.2

85

(5)

45

01

-94

-04

-Hy

dro

xy

ph

enet

hy

lal

coho

l2

0.8

30

.52

0.3

03

53

54

4-2

5-0

4-A

min

ob

enzy

lcy

anid

e2

0.7

60

.34

0.3

02

66

10

-15

-12

-Nit

rob

enza

mid

e2

0.7

22

0.1

20

.332

(3)

74

98

-00

-04

-Hy

dro

xy

-3-m

eth

ox

yb

enzy

lal

coho

l2

0.7

00

.29

0.3

00

(7)

89

0-0

4-0

2-M

eth

ox

yan

ilin

e2

0.6

91

.18

0.2

95

99

8-8

5-1

(sec

)-P

hen

eth

yl

alco

ho

l2

0.6

61

.42

0.2

85

10

10

8-4

6-3

1,3

-Dih

yd

rox

yb

enze

ne

20

.65

0.8

00

.307

11

14

89

8-8

7-4

1-P

hen

yl-

2-p

rop

anol

20

.62

1.9

70

.283

12

60

-12-8

Ph

enet

hy

lal

coho

l2

0.5

91

.36

0.2

86

13

61

7-9

4-7

2-P

hen

yl-

2-p

rop

anol

20

.57

1.8

10

.288

14

53

22

2-9

2-7

3-A

min

o-2

-cre

sol

20

.55

0.7

00

.301

15

90

-72-2

2,4

,6-T

ris-

(Dim

eth

yla

min

om

eth

yl)

ph

eno

l2

0.5

20

.92

0.3

02

16

58

9-1

8-4

4-M

eth

ylb

enzy

lal

coho

l2

0.4

91

.58

0.2

84

17

93

7-3

9-3

Ph

eny

lace

tic

acid

hy

dra

zid

e2

0.4

80

.14

0.3

19

18

22

37

-30

-13

-Cy

ano

anil

ine

20

.47

1.0

70

.306

(3)

19

98

-86-2

Ace

toph

eno

ne

20

.46

1.6

30

.318

20

89

-95-2

2-M

eth

ylb

enzy

lal

coho

l2

0.4

31

.55

0.2

85

21

93

-54-9

(^)1

-Ph

eny

l-1

-pro

pan

ol

20

.43

1.9

40

.286

22

87

-59-2

2,3

-Dim

eth

yla

nil

ine

20

.43

1.8

10

.294

23

87

-62-7

2,6

-Dim

eth

yla

nil

ine

20

.43

1.8

40

.294

24

10

0-8

6-7

2-M

eth

yl-

1-p

hen

yl-

2-p

rop

anol

20

.41

1.8

60

.285

25

58

9-0

8-2

N-M

eth

ylp

hen

eth

yla

min

e2

0.4

11

.43

0.2

85

26

11

23

-85

-92

-Ph

eny

l-1

-pro

pan

ol

20

.40

1.5

80

.286

27

45

6-4

7-3

3-F

luo

rob

enzy

lal

coh

ol

20

.39

1.2

50

.305

28

14

19

1-9

5-8

4-H

yd

roxy

ben

zyl

cyan

ide

20

.38

0.9

00

.309

29

30

34

-34

-24

-Cy

ano

ben

zam

ide

20

.38

0.4

80

.322

(7)

30

34

8-5

4-9

2-F

luo

roan

ilin

e2

0.3

71

.26

0.3

02

31

10

8-6

9-0

3,5

-Dim

eth

yla

nil

ine

20

.36

1.9

10

.293

32

14

0-2

9-4

Ben

zyl

cyan

ide

20

.36

1.5

60

.294

33

10

8-9

5-2

Ph

eno

l2

0.3

51

.50

0.3

01

34

15

0-1

9-6

3-M

eth

ox

yp

hen

ol

20

.33

1.5

80

.305

35

95

-78-3

2,5

-Dim

eth

yla

nil

ine

20

.33

1.8

30

.294

36

95

-48-7

2-M

eth

ylp

hen

ol

20

.29

1.9

80

.301

37

95

-68-1

2,4

-Mim

eth

yla

nil

ine

20

.29

1.6

80

.293


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

38

10

8-4

4-1

3-M

eth

yla

nil

ine

20

.28

1.4

00

.294

39

58

2-2

2-9

b-M

eth

ylp

hen

eth

yla

min

e2

0.2

81

.68

0.2

85

(11

)4

06

99

-02

-54

-Met

hy

lph

enet

hy

lal

coho

l2

0.2

61

.68

0.2

85

41

10

0-4

6-9

Ben

zyla

min

e2

0.2

41

.09

0.2

84

42

52

9-1

9-1

2-T

olu

nit

rile

20

.24

2.2

10

.302

43

58

7-0

3-1

3-M

eth

ylb

enzy

lal

coho

l2

0.2

41

.60

0.2

85

44

62

-53-3

An

ilin

e2

0.2

30

.90

0.2

95

(10

)4

55

78

-54

-12

-Eth

yla

nil

ine

20

.22

1.7

40

.295

(7)

46

61

9-2

5-0

3-N

itro

ben

zyl

alco

ho

l2

0.2

21

.21

0.3

15

47

12

2-9

7-4

3-P

hen

yl-

1-p

rop

anol

20

.21

1.8

80

.285

48

100-5

2-7

Ben

zald

ehyde

20

.20

1.4

80

.317

49

12

7-6

6-2

2-P

hen

yl-

3-b

uty

n-2

-ol

20

.18

1.8

80

.300

50

61

8-3

6-0

1-P

hen

yle

thy

lam

ine

20

.18

1.4

00

.285

51

95

-51-2

2-C

hlo

roan

ilin

e2

0.1

71

.88

0.3

04

52

12

00

55

-09

-61

-Ph

eny

l-2

-bu

tano

l2

0.1

62

.02

0.2

84

53

95

-64-7

3,4

-Dim

eth

yla

nil

ine

20

.16

1.8

60

.293

54

95

-53-4

2-M

eth

yla

nil

ine

20

.16

1.4

30

.294

55

10

6-4

4-5

4-M

eth

ylp

hen

ol

20

.16

1.9

70

.300

56

64

5-5

9-0

3-P

hen

ylp

rop

ion

itri

le2

0.1

61

.72

0.2

94

57

62

1-4

2-1

3-A

ceta

mid

op

hen

ol

20

.16

0.7

30

.322

58

15

0-7

6-5

4-M

eth

oxy

ph

eno

l2

0.1

41

.34

0.2

98

59

10

3-7

3-1

Ph

enet

ole

20

.14

2.5

10

.300

60

621-5

9-0

3-H

ydro

xy-4

-met

hoxyben

zald

ehyde

20

.14

0.9

70

.317

(10

)6

11

08

-90

-7C

hlo

rob

enze

ne

20

.13

2.8

40

.311

62

71

-43-2

Ben

zene

20

.12

2.1

30

.280

(3)

63

89

10

4-4

6-1

2-P

hen

yl-

1-b

uta

no

l2

0.1

12

.11

0.2

88

64

622-3

2-2

Ben

zald

oxim

e2

0.1

11

.75

0.2

91

65

10

0-6

6-3

An

iso

le2

0.1

02

.11

0.3

00

66

37

2-1

9-0

3-F

luo

roan

ilin

e2

0.1

01

.30

0.3

07

67

4460-8

6-0

2,4

,5-T

rim

ethoxyben

zald

ehyde

20

.10

1.1

90

.317

68

22

13

5-4

9-5

(S^

)-1

-Ph

eny

l-1

-bu

tano

l2

0.0

92

.47

0.2

86

69

50

0-9

9-2

3,5

-Dim

eth

ox

yp

hen

ol

20

.09

1.6

40

.309

70

10

8-3

9-4

3-M

eth

ylp

hen

ol

20

.08

1.9

80

.300

71

10

4-5

4-1

3-P

hen

yl-

2-p

rop

en-1

-ol

20

.08

1.9

50

.285

72

10

3-0

5-9

a,a

-Dim

eth

ylb

enze

nep

rop

ano

l2

0.0

72

.42

0.2

85

73

93

-55-0

Pro

pio

ph

eno

ne

20

.07

2.1

90

.318

74

91

-23-6

2-N

itro

anis

ole

20

.07

1.7

30

.332


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

75

10

6-4

9-0

4-M

eth

yla

nil

ine

20

.05

1.3

90

.293

76

88

-05-1

2,4

,6-T

rim

ethy

lan

ilin

e2

0.0

52

.31

0.2

93

(11

)7

73

26

1-6

2-9

2-(

4-T

oly

l)-e

thy

lam

ine

20

.04

1.7

80

.284

78

58

7-0

2-0

3-E

thy

lan

ilin

e2

0.0

31

.94

0.2

94

79

12

1-3

3-5

3-M

eth

ox

y-4

-hy

dro

xy

ben

zald

ehy

de

20

.03

1.2

10

.318

80

44

21

-08

-34

-Hy

dro

xy

-3-m

eth

ox

yb

enzo

nit

rile

20

.03

1.4

20

.315

81

45

53

-07

-5E

thy

lp

hen

ylc

yan

oac

etat

e2

0.0

21

.63

0.3

32

82

22

14

4-6

0-1

(R^

)-1

-Ph

eny

l-1

-bu

tano

l2

0.0

12

.47

0.2

86

83

10

4-8

4-7

4-M

eth

yl

ben

zyla

min

e2

0.0

12

.81

0.2

84

84

63

7-5

3-6

Th

ioac

etan

ilid

e2

0.0

11

.71

0.3

41

85

27

22

-36

-33

-Ph

eny

l-1

-bu

tano

l0

.01

2.1

10

.286

86

18

23

-91

-2a

-Met

hy

lben

zyl

cyan

ide

0.0

11

.87

0.2

94

87

62

2-6

2-8

4-E

tho

xyp

hen

ol

0.0

11

.81

0.2

98

88

121-3

2-4

3-E

thoxy-4

-hydro

xyben

zald

ehyde

0.0

21.5

80.3

17

89

37

1-4

1-5

4-F

luo

rop

hen

ol

0.0

21

.77

0.3

07

(11

)9

05

89

-16

-24

-Eth

yla

nil

ine

0.0

31

.96

0.2

94

91

99

-09-2

3-N

itro

anil

ine

0.0

31

.43

0.3

19

92

10

6-4

7-8

4-C

hlo

roan

ilin

e0

.05

1.8

30

.302

93

15

65

-75

-9(^

)-2-P

hen

yl-

2-b

uta

no

l0

.06

2.3

40

.288

94

10

0-4

4-7

Ben

zyl

chlo

rid

e0

.06

2.3

00

.298

95

10

0-6

1-8

N-M

eth

yla

nil

ine

0.0

61

.66

0.2

95

96

76

8-5

9-2

4-E

thy

lben

zyl

alco

ho

l0

.07

2.1

30

.285

97

10

3-6

9-5

N-E

thy

lan

ilin

e0

.07

2.1

60

.295

98

10

8-8

6-1

Bro

mob

enze

ne

0.0

82

.99

0.3

08

99

88

-74-4

2-N

itro

anil

ine

0.0

81

.85

0.3

32

(9)

10

01

82

1-3

9-2

2-P

ropy

lan

ilin

e0

.08

2.4

20

.295

10

11

00

-83

-43

-Hy

dro

xy

ben

zald

ehy

de

0.0

81

.38

0.3

20

10

22

22

7-7

9-4

Th

iob

enza

mid

e0

.09

1.5

00

.339

10

33

50

-46

-91

-Flu

oro

-4-n

itro

ben

zen

e0

.10

1.8

90

.338

10

41

89

82

-54

-22

-Bro

mo

ben

zyl

alco

ho

l0

.10

1.9

70

.309

10

58

74

-90

-84

-Met

ho

xy

ben

zonit

rile

0.1

01

.70

0.3

15

10

61

08

-68

-93

,5-D

imet

hy

lphen

ol

0.1

12

.35

0.3

00

10

79

9-6

1-6

3-N

itro

ben

zald

ehy

de

0.1

11

.47

0.3

32

10

83

36

0-4

1-6

4-P

hen

yl-

1-b

uta

no

l0

.12

2.3

50

.285

10

97

0-7

0-2

40 -

Hyd

rox

yp

rop

iop

hen

on

e0

.12

2.0

30

.318

11

06

43

-28

-72

-iso

-Pro

pyla

nil

ine

0.1

22

.12

0.2

94

11

19

5-6

5-8

3,4

-Dim

eth

ylp

hen

ol

0.1

22

.23

0.2

99


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

11

25

26

-75

-02

,3-D

imet

hy

lphen

ol

0.1

22

.42

0.3

00

11

39

5-8

8-5

4-C

hlo

rore

sorc

ino

l0

.13

1.8

00

.316

11

41

05

-67

-92

,4-D

imet

hy

lphen

ol

0.1

42

.35

0.3

00

11

51

56

-41

-22

-(4

-Chlo

rop

hen

yl)

-eth

yla

min

e0

.14

2.0

00

.311

11

69

8-9

5-3

Nit

rob

enze

ne

0.1

41

.85

0.3

18

11

79

5-8

7-4

2,5

-Dim

eth

ylp

hen

ol

0.1

42

.34

0.3

00

11

82

04

6-1

8-6

4-P

hen

ylb

uty

ron

itri

le0

.15

2.2

10

.293

11

98

73

-63

-23

-Ch

loro

ben

zyl

alco

ho

l0

.15

1.9

40

.314

12

01

35

-02

-42

-An

isal

deh

yd

e0

.15

1.7

20

.316

(6)

12

19

0-0

0-6

2-E

thy

lph

enol

0.1

62

.47

0.3

01

12

21

04

-86

-94

-Ch

loro

ben

zyla

min

e0

.16

1.8

10

.311

12

37

05

-73

-7(^

)-1-P

hen

yl-

2-p

enta

no

l0

.16

2.5

50

.284

12

44

36

0-4

7-8

Cin

nam

onit

rile

0.1

61

.95

0.3

03

12

55

52

-89

-62

-Nit

rob

enza

ldeh

yd

e0

.17

1.7

40

.333

12

61

00

-68

-5T

hio

anis

ole

0.1

82

.74

0.2

96

12

76

15

-65

-62

-Ch

loro

-4-m

eth

yla

nil

ine

0.1

82

.41

0.3

03

(4)

12

85

36

-60

-74

-iso

-Pro

pylb

enzy

lal

coho

l0

.18

2.5

30

.284

12

96

26

-19

-7P

hen

yl-

1,3

-dia

ldeh

yd

e0

.18

1.3

60

.324

13

03

67

-12

-42

-Flu

oro

ph

enol

0.1

91

.67

0.3

09

13

15

55

-16

-84

-Nit

rob

enza

ldeh

yd

e0

.20

1.5

60

.333

13

21

23

-07

-94

-Eth

ylp

hen

ol

0.2

12

.50

0.3

00

13

34

95

-40

-9B

uty

rop

hen

on

e0

.21

2.7

70

.318

13

49

9-8

8-7

4-i

so-P

rop

yla

nil

ine

0.2

22

.47

0.2

93

13

51

08

-42

-93

-Ch

loro

anil

ine

0.2

21

.88

0.3

12

13

61

00

-10

-74

-(D

imet

hy

lam

ino

)-b

enza

ldeh

yd

e0

.23

1.8

10

.310

13

75

99

1-3

1-1

3-A

nis

ald

ehy

de

0.2

31

.71

0.3

17

13

81

49

3-2

7-2

1-F

luo

ro-2

-nit

rob

enze

ne

0.2

31

.69

0.3

43

13

91

06

-42

-34

-Xy

len

e0

.25

3.1

50

.283

14

01

08

-88

-3T

olu

ene

0.2

52

.73

0.2

84

(10

)1

41

10

4-9

3-8

4-M

eth

yla

nis

ole

0.2

52

.81

0.2

99

14

28

73

-76

-74

-Ch

loro

ben

zyl

alco

ho

l0

.25

1.9

60

.312

14

38

9-8

4-9

2,4

-Dih

yd

rox

yac

eto

ph

eno

ne

0.2

51

.41

0.3

25

14

48

8-7

2-2

2-N

itro

tolu

ene

0.2

62

.30

0.3

17

145

771-6

0-8

Pen

tafl

uoro

anil

ine

0.2

61.8

70.3

38

14

61

00

8-8

9-5

2-P

hen

ylp

yri

din

e0

.27

2.6

30

.299

147

704-1

3-2

3-H

ydro

xy-4

-nit

roben

zald

ehyde

0.2

71.4

70.3

45

14

82

41

6-9

4-6

2,3

,6-T

rim

ethy

lph

eno

l0

.28

2.6

70

.300


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

14

96

20

-17

-73

-Eth

ylp

hen

ol

0.2

92

.50

0.3

00

15

05

79

-66

-82

,6-D

ieth

yla

nil

ine

0.3

12

.87

0.2

95

151

18358-6

3-9

Met

hyl-

4-m

ethyla

min

oben

zoat

e0.3

12.1

60.3

16

15

26

13

-90

-1B

enzo

yl

cyan

ide

0.3

11

.91

0.3

47

15

31

87

5-8

8-3

4-C

hlo

rop

hen

eth

yl

alco

ho

l0

.32

1.9

00

.312

15

41

21

-89

-130 -

Nit

roac

etophen

one

0.3

21.4

20.3

32

15

51

74

5-8

1-9

2-A

lly

lph

enol

0.3

32

.55

0.3

01

156

42454-0

6-8

5-H

ydro

xy-2

-nit

roben

zald

ehyde

0.3

31.7

50.3

36

15

79

5-5

6-7

2-B

rom

op

hen

ol

0.3

32

.33

0.3

14

15

83

64

-74

-92

,5-D

iflu

oro

nit

rob

enze

ne

0.3

31

.86

0.3

49

15

99

5-6

9-2

4-C

hlo

ro-2

-met

hyla

nil

ine

0.3

52

.36

0.3

01

16

06

15

-43

-02

-Io

do

anil

ine

0.3

52

.32

0.3

07

16

16

97

-82

-52

,3,5

-Tri

met

hy

lph

eno

l0

.36

2.9

20

.299

16

25

91

-50

-4Io

do

ben

zen

e0

.36

3.2

50

.301

(9)

16

37

69

-92

-64

-(te

rt)-

Bu

tyla

nil

ine

0.3

62

.70

0.2

93

16

48

9-6

2-3

4-M

eth

yl-

2-n

itro

anil

ine

0.3

71

.82

0.3

31

16

51

19

9-4

6-8

2-A

min

o-4

-(te

rt)-

bu

tylp

hen

ol

0.3

72

.44

0.2

95

16

61

01

-82

-62

-Ben

zylp

yri

din

e0

.38

2.7

10

.296

16

78

7-6

0-5

3-C

hlo

ro-2

-met

hyla

nil

ine

0.3

82

.36

0.3

12

16

89

5-7

4-9

3-C

hlo

ro-4

-met

hyla

nil

ine

0.3

92

.41

0.3

11

16

96

19

-50

-1M

eth

yl-

4-n

itro

ben

zoat

e0

.39

1.9

40

.336

17

01

04

-88

-14

-Ch

loro

ben

zald

ehy

de

0.4

02

.13

0.3

25

17

11

05

21

-91

-25

-Ph

eny

l-1

-pen

tano

l0

.42

2.7

70

.285

172

103-6

3-9

(2-B

rom

oet

hyl)

-ben

zene

0.4

23.0

90.2

94

17

35

27

-60

-62

,4,6

-Tri

met

hy

lph

eno

l0

.42

2.7

30

.299

17

49

9-0

8-1

3-N

itro

tolu

ene

0.4

22

.45

0.3

17

17

59

0-0

2-8

2-H

yd

roxy

ben

zald

ehy

de

0.4

21

.81

0.3

18

17

61

00

-00

-51

-Ch

loro

-4-n

itro

ben

zen

e0

.43

2.3

90

.340

17

75

29

2-4

5-5

Dim

ethy

lnit

rote

rep

hth

alat

e0

.43

1.6

60

.340

17

85

92

2-6

0-1

2-A

min

o-5

-ch

loro

ben

zon

itri

le0

.44

1.7

90

.323

17

96

19

-24

-93

-Nit

rob

enzo

nit

rile

0.4

51

.17

0.3

30

(4)

18

01

06

-38

-74

-Bro

mo

tolu

ene

0.4

73

.50

0.3

06

18

11

00

8-8

8-4

3-P

hen

ylp

yri

din

e0

.47

2.5

30

.296

18

29

9-8

9-8

4-i

so-P

rop

ylp

hen

ol

0.4

72

.90

0.3

00

18

38

77

-65

-64

-(te

rt)-

Bu

tylb

enzy

lal

coho

l0

.48

2.9

30

.285

18

49

1-0

1-0

Ben

zhy

dro

l0

.50

2.6

70

.289

18

59

5-7

9-4

5-C

hlo

ro-2

-met

hyla

nil

ine

0.5

02

.36

0.3

11


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

18

65

54

-84

-73

-Nit

rop

hen

ol

0.5

12

.00

0.3

24

18

79

5-5

0-1

1,2

-Dic

hlo

rob

enze

ne

0.5

33

.38

0.3

19

188

6361-2

1-3

2-C

hlo

ro-5

-nit

roben

zald

ehyde

0.5

32.2

50.3

57

18

91

06

-48

-94

-Ch

loro

ph

eno

l0

.54

2.3

90

.308

19

05

65

1-8

8-7

Ph

eny

lp

rop

arg

yl

sulfi

de

0.5

43

.30

0.2

98

19

16

15

-74

-72

-Ch

loro

-5-m

eth

ylp

hen

ol

0.5

42

.65

0.3

09

19

25

52

-41

-02

-Hy

dro

xy

-4-m

eth

ox

yac

eto

ph

eno

ne

0.5

51

.98

0.3

24

19

35

54

-00

-72

,4-D

ich

loro

anil

ine

0.5

62

.78

0.3

11

19

48

3-4

1-0

1,2

-Dim

eth

yl-

3-n

itro

ben

zen

e0

.56

2.8

30

.316

19

51

00

9-1

4-9

Val

ero

ph

eno

ne

0.5

63

.17

0.3

18

19

61

19

-33

-54

-Met

hy

l-2

-nit

rop

hen

ol

0.5

72

.15

0.3

33

19

79

5-8

2-9

2,5

-Dic

hlo

roan

ilin

e0

.58

2.7

50

.318

19

81

03

-26

-4tr

an

s-M

eth

yl

cin

nam

ate

0.5

82

.62

0.3

19

19

99

9-5

1-4

1,2

-Dim

eth

yl-

4-n

itro

ben

zen

e0

.59

2.9

10

.314

20

07

12

0-4

3-6

5-C

hlo

ro-2

-hy

dro

xy

ben

zam

ide

0.5

92

.13

0.3

23

20

17

00

-38

-95

-Met

hy

l-2

-nit

rop

hen

ol

0.5

92

.31

0.3

33

20

26

23

-12

-14

-Ch

loro

anis

ole

0.6

02

.79

0.3

07

20

36

62

7-5

5-0

2-B

rom

o-4

-met

hy

lph

enol

0.6

02

.85

0.3

12

204

16532-7

9-9

4-B

rom

ophen

yl

acet

onit

rile

0.6

02.4

30.3

15

(8)

20

54

34

4-5

5-2

4-B

uto

xy

anil

ine

0.6

12

.59

0.2

93

20

63

02

73

-11

-14

-sec

-Bu

tyla

nil

ine

0.6

12

.87

0.2

94

(8)

20

76

18

-45

-13

-iso

-Pro

pylp

hen

ol

0.6

12

.90

0.3

00

20

88

8-6

9-7

2-i

so-P

rop

ylp

hen

ol

0.6

12

.88

0.3

01

20

94

92

0-7

7-8

3-M

eth

yl-

2-n

itro

ph

eno

l0

.61

2.2

90

.332

210

3011-3

4-5

4-H

ydro

xy-3

-nit

roben

zald

ehyde

0.6

11.4

80.3

52

(10)

21

12

97

3-7

6-4

5-B

rom

ovan

illi

n0

.62

1.9

20

.326

21

24

02

-45

-9a

,a,a

-Tri

flu

oro

-4-c

reso

l0

.62

2.8

20

.340

21

32

11

6-6

5-6

4-B

enzy

lpy

ridin

e0

.63

2.6

20

.298

21

46

45

-56

-74

-Pro

py

lph

eno

l0

.64

3.2

00

.300

21

52

70

0-2

2-3

Ben

zyli

den

em

alo

non

itri

le0

.64

2.1

50

.328

21

69

9-9

9-0

4-N

itro

tolu

ene

0.6

52

.37

0.3

15

21

76

26

-01

-73

-Io

do

anil

ine

0.6

52

.90

0.3

02

21

82

49

5-3

7-6

Ben

zyl

met

hac

ryla

te0

.65

2.5

30

.320

21

91

40

-53

-44

-Ch

loro

ben

zyl

cyan

ide

0.6

62

.47

0.3

19

22

05

42

8-5

4-6

2-M

eth

yl-

5-n

itro

ph

eno

l0

.66

2.3

50

.321

22

16

01

-89

-82

-Nit

rore

sorc

ino

l0

.66

1.5

60

.341

22

21

58

5-0

7-5

1-B

rom

o-4

-eth

ylb

enze

ne

0.6

74

.03

0.3

06


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

22

31

22

-03

-24

-iso

-Pro

pylb

enza

ldeh

yde

0.6

72

.92

0.3

16

22

48

8-7

5-5

2-N

itro

ph

eno

l0

.67

1.7

70

.335

22

51

06

-37

-61

,4-D

ibro

mo

ben

zen

e0

.68

3.7

90

.317

22

68

3-4

2-1

2-C

hlo

ro-6

-nit

roto

luen

e0

.68

3.0

90

.329

22

78

8-7

3-3

1-C

hlo

ro-2

-nit

rob

enze

ne

0.6

82

.52

0.3

43

22

81

06

-41

-24

-Bro

mo

ph

enol

0.6

82

.59

0.3

11

22

91

13

7-4

1-3

4-B

enzo

yla

nil

ine

0.6

82

.46

0.3

17

23

09

8-8

2-8

iso

-Pro

pylb

enze

ne

0.6

93

.66

0.2

85

23

11

12

4-0

4-5

2-C

hlo

ro-4

,5-d

imet

hy

lphen

ol

0.6

93

.10

0.3

09

23

21

22

-94

-14

-Bu

tox

yp

hen

ol

0.7

02

.90

0.2

98

23

31

57

0-6

4-5

4-C

hlo

ro-2

-met

hylp

hen

ol

0.7

02

.78

0.3

07

23

46

26

-43

-73

,5-D

ich

loro

anil

ine

0.7

12

.90

0.3

19

23

53

64

36

-65

-42

-Hy

dro

xy

-4,5

-dim

ethy

lace

top

hen

on

e0

.71

2.8

60

.316

23

69

9-7

7-4

Eth

yl-

4-n

itro

ben

zoat

e0

.71

2.3

30

.335

23

75

55

-03

-33

-Nit

roan

iso

le0

.72

2.1

70

.321

23

89

7-0

2-9

2,4

-Din

itro

anil

ine

0.7

21

.72

0.3

61

(3)

23

91

21

-73

-31

-Ch

loro

-3-n

itro

ben

zen

e0

.73

2.4

70

.332

24

08

7-6

5-0

2,6

-Dic

hlo

rop

hen

ol

0.7

32

.64

0.3

20

24

15

85

-34

-23

-ter

t-B

uty

lph

enol

0.7

43

.30

0.3

00

24

22

93

38

-49

-61

,1-D

iph

eny

l-2

-pro

pan

ol

0.7

52

.93

0.2

90

24

31

21

-87

-92

-Ch

loro

-4-n

itro

anil

ine

0.7

52

.05

0.3

36

24

45

77

-19

-51

-Bro

mo

-2-n

itro

ben

zene

0.7

52

.51

0.3

38

(8)

24

59

7-5

4-1

2-M

eth

ox

y-4

-pro

pen

ylp

hen

ol

0.7

53

.31

0.3

02

24

62

97

3-1

9-5

2-C

hlo

rom

eth

yl-

4-n

itro

ph

eno

l0

.75

2.4

20

.338

24

77

80

56

-39

-04

,5-D

iflu

oro

-2-n

itro

anil

ine

0.7

52

.19

0.3

48

24

82

45

44

-04

-52

,6-D

iiso

pro

pyla

nil

ine

0.7

63

.18

0.2

94

24

96

52

62

-96

-63

-Ch

loro

-5-m

eth

ox

yp

hen

ol

0.7

62

.50

0.3

22

25

06

16

-86

-44

-Eth

ox

y-2

-nit

roan

ilin

e0

.76

2.3

90

.326

(3)

25

19

9-6

5-0

1,3

-Din

itro

ben

zen

e0

.76

1.4

90

.345

25

22

35

7-4

7-3

a,a

,a-4

-Tet

rafl

uo

ro-3

-to

luid

ine

0.7

72

.51

0.3

41

25

39

4-3

0-4

Eth

yl-

4-m

eth

oxy

ben

zoat

e0

.77

2.8

10

.319

25

45

34

2-8

7-0

(^

)-1,2

-Dip

hen

yl-

2-p

rop

anol

0.8

03

.23

0.2

90

25

55

9-5

0-7

4-C

hlo

ro-3

-met

hylp

hen

ol

0.8

03

.10

0.3

07

25

63

50

-30

-13

-Ch

loro

-4-fl

uo

ron

itro

ben

zen

e0

.80

2.7

40

.347

(11

)2

57

29

05

-69

-3M

eth

yl-

2,5

-dic

hlo

rob

enzo

ate

0.8

13

.16

0.3

32

25

88

9-5

9-8

4-C

hlo

ro-2

-nit

roto

luen

e0

.82

3.0

50

.328

259

653-3

7-2

Pen

tafl

uoro

ben

zald

ehyde

0.8

22.3

90.3

57

(8)


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

26

01

45

48

-45

-94

-Bro

mo

ph

eny

l-3

-py

rid

yl

ket

on

e0

.82

2.9

60

.331

26

14

20

87

-80

-9M

eth

yl-

4-c

hlo

ro-2

-nit

rob

enzo

ate

0.8

22

.41

0.3

42

26

21

00

-29

-84

-Nit

rop

hen

eto

le0

.83

2.5

30

.328

26

35

73

-56

-82

,6-D

init

rop

hen

ol

0.8

31

.33

0.3

72

26

46

06

-22

-42

,6-D

init

roan

ilin

e0

.84

1.7

90

.366

26

55

40

-38

-54

-Io

do

ph

eno

l0

.85

2.9

00

.311

266

603-7

1-4

1,3

,5-T

rim

ethyl-

2-n

itro

ben

zene

0.8

63.2

20.3

13

(5)

26

72

43

0-1

6-2

6-P

hen

yl-

1-h

exan

ol

0.8

73

.30

0.2

85

(5)

26

81

08

-43

-03

-Ch

loro

ph

eno

l0

.87

2.5

00

.317

269

119-6

1-9

Ben

zophen

one

0.8

73.1

80.3

21

27

01

08

-70

-31

,3,5

-Tri

chlo

rob

enze

ne

0.8

74

.19

0.3

25

27

11

21

-14

-22

,4-D

init

roto

luen

e0

.87

1.9

80

.345

(5)

27

29

8-5

4-4

4-(

tert

)-B

uty

lph

eno

l0

.91

3.3

10

.300

27

33

59

7-9

1-9

4-B

iphen

ylm

eth

ano

l0

.92

2.9

90

.287

27

45

27

-54

-83

,4,5

-Tri

met

hy

lph

eno

l0

.93

2.8

70

.298

27

51

31

-55

-52

,20 ,4

,40 -

Tet

rah

yd

rox

yb

enzo

ph

eno

ne

0.9

62

.92

0.3

33

27

63

99

05

-50

-54

-Pen

tylo

xy

anil

ine

0.9

73

.12

0.2

93

27

76

11

-06

-32

,4-D

ich

loro

nit

rob

enze

ne

0.9

93

.09

0.3

50

27

81

03

-36

-6(t

ran

s)-

Eth

yl

cin

nam

ate

0.9

92

.99

0.3

18

27

91

13

7-4

2-4

4-B

enzo

ylp

hen

ol

1.0

23

.07

0.3

21

28

01

13

7-4

2-4

4-B

enzo

ylp

hen

ol

1.0

23

.07

0.3

21

28

15

85

-79

-51

-Bro

mo

-3-n

itro

ben

zene

1.0

32

.64

0.3

28

28

21

20

-83

-22

,4-D

ich

loro

ph

eno

l1

.04

3.1

70

.318

28

33

29

-71

-52

,5-D

init

rop

hen

ol

1.0

41

.86

0.3

61

28

48

74

-42

-02

,4-D

ich

loro

ben

zald

ehy

de

1.0

43

.08

0.3

35

28

59

2-5

2-4

Bip

hen

yl

1.0

53

.98

0.2

88

(8)

28

65

1-2

8-5

2,4

-Din

itro

phen

ol

1.0

61

.54

0.3

68

28

71

04

-13

-24

-Bu

tyla

nil

ine

1.0

73

.18

0.2

94

28

89

5-7

5-0

3,4

-Dic

hlo

roto

luen

e1

.07

3.9

50

.318

28

93

20

9-2

2-1

2,3

-Dic

hlo

ron

itro

ben

zen

e1

.07

3.0

50

.350

290

2491-3

2-9

Ben

zyl-

4-h

ydro

xyphen

yl

ket

one

1.0

73.2

20.3

21

29

11

20

-82

-11

,2,4

-Tri

chlo

rob

enze

ne

1.0

84

.02

0.3

26

29

21

41

43

-32

-94

-Ch

loro

-3-e

thy

lph

eno

l1

.08

3.5

10

.308

29

33

81

9-8

8-3

1-F

luo

ro-3

-io

do

-5-n

itro

ben

zen

e1

.09

3.1

50

.335

294

136-3

6-7

Res

orc

inol

monoben

zoat

e1.1

13.1

30.3

30

29

53

53

1-1

9-9

6-C

hlo

ro-2

,4-d

init

roan

ilin

e1

.12

2.4

60

.370

(6)

29

63

21

8-3

6-8

4-B

iphen

ylc

arb

ox

ald

ehyd

e1

.12

3.3

80

.317


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nu

ed

No

.C

AS

Na

me

logðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

uati

on

s *

29

76

18

-62

-23

,5-D

ich

loro

nit

rob

enze

ne

1.1

33

.09

0.3

39

29

88

9-6

1-2

2.5

-Dic

hlo

ronit

rob

enze

ne

1.1

33

.03

0.3

49

29

97

14

9-7

0-4

2-B

rom

o-5

-nit

roto

luen

e1

.16

3.2

50

.334

30

09

9-5

4-7

3,4

-Dic

hlo

ronit

rob

enze

ne

1.1

63

.12

0.3

48

30

11

87

9-0

9-0

6-t

ert-

Bu

tyl-

2,4

-dim

eth

ylp

hen

ol

1.1

64

.30

0.3

00

30

22

37

4-0

5-2

4-B

rom

o-2

,6-d

imet

hy

lphen

ol

1.1

63

.63

0.3

09

30

38

35

-11

-02

,20 -

Dih

yd

rox

yb

enzo

ph

eno

ne

1.1

63

.47

0.3

36

30

41

68

9-8

4-5

3,5

-Dib

rom

o-4

-hy

dro

xy

ben

zon

itri

le1

.16

2.8

80

.341

305

5736-9

1-4

4-(

Pen

tylo

xy)-

ben

zald

ehyde

1.1

83.8

90.3

15

(9)

306

100-1

4-1

4-N

itro

ben

zyl

chlo

ride

1.1

82.4

50.3

23

30

79

42

-92

-7H

exan

oph

eno

ne

1.1

93

.70

0.3

18

30

88

8-0

4-0

4-C

hlo

ro-3

,5-d

imet

hy

lph

enol

1.2

03

.48

0.3

06

30

98

0-4

6-6

4-t

ert-

Pen

tylp

hen

ol

1.2

33

.83

0.3

00

(3)

31

07

77

8-8

3-8

n-P

rop

yl

cin

nam

ate

1.2

33

.52

0.3

18

31

11

81

7-7

3-8

2-B

rom

o-4

,6-d

init

roan

ilin

e1

.24

2.6

10

.372

31

21

04

-51

-8n-B

uty

lben

zen

e1

.25

4.2

60

.284

31

35

28

-29

-01

,2-D

init

rob

enze

ne

1.2

51

.69

0.3

51

31

49

0-9

0-4

4-B

rom

ob

enzo

ph

eno

ne

1.2

64

.12

0.3

26

31

52

68

3-4

3-4

2,4

-Dic

hlo

ro-6

-nit

roan

ilin

e1

.26

3.3

30

.349

316

67-3

6-7

4-P

hen

oxyben

zald

ehyde

1.2

63.9

60.3

17

31

76

10

-78

-64

-Chlo

ro-3

-nit

rop

hen

ol

1.2

72

.46

0.3

39

318

7530-2

7-0

4-B

rom

o-6

-chlo

ro-2

-cre

sol

1.2

83.6

10.3

19

31

96

36

-30

-62

,4,5

-Tri

chlo

roan

ilin

e1

.30

3.6

90

.325

32

01

00

-25

-41

,4-D

init

rob

enze

ne

1.3

01

.47

0.3

47

32

18

6-0

0-0

2-N

itro

bip

hen

yl

1.3

03

.77

0.3

22

32

25

00

-66

-35

-Pen

tylr

eso

rcin

ol

1.3

13

.42

0.3

05

32

35

79

8-7

5-4

Eth

yl-

4-b

rom

ob

enzo

ate

1.3

33

.50

0.3

25

32

41

36

08

-87

-220 ,30 ,40 -

Tri

chlo

roac

eto

ph

eno

ne

1.3

43

.21

0.3

36

(6)

32

59

3-9

9-2

Ph

eny

lb

enzo

ate

1.3

53

.59

0.3

27

32

61

76

96

-62

-7P

hen

yl-

4-h

yd

roxy

ben

zoat

e1

.37

3.4

90

.327

32

73

46

0-1

8-2

2,5

-Dib

rom

on

itro

ben

zen

e1

.37

3.4

10

.346

32

83

99

05

-57

-24

-Hex

ylo

xy

anil

ine

1.3

83

.65

0.2

93

32

96

15

-58

-72

,4-D

ibro

mop

hen

ol

1.4

03

.25

0.3

23

(10

)3

30

88

-06-2

2,4

,6-T

rich

loro

phen

ol

1.4

13

.69

0.3

26

33

11

03

-72

-0P

hen

yl

iso

thio

cyan

ate

1.4

13

.28

0.3

52

(4)

33

21

31

-57

-72

-Hy

dro

xy

-4-m

eth

ox

yben

zop

hen

on

e1

.42

3.5

80

.327

33

31

87

08

-70

-81

,3,5

-Tri

chlo

ro-2

-nit

rob

enze

ne

1.4

33

.69

0.3

54


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No.

CA

SN

am

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

33

41

20

-51

-4B

enzy

lb

enzo

ate

1.4

53

.97

0.3

21

33

56

52

1-3

0-8

iso

-Am

yl-

4-h

yd

rox

yb

enzo

ate

1.4

83

.97

0.3

20

33

68

44

-51

-92

,5-D

iph

eny

l-1

,4-b

enzo

qu

ino

ne

1.4

83

.16

0.3

32

33

71

34

-85

-04

-Ch

loro

ben

zoph

eno

ne

1.5

03

.97

0.3

25

(4)

33

81

77

00

-09

-31

,2,3

-Tri

chlo

ro-4

-nit

rob

enze

ne

1.5

13

.61

0.3

57

33

98

9-6

9-0

1,2

,4-T

rich

loro

-5-n

itro

ben

zen

e1

.53

3.4

70

.354

34

05

38

-65

-8n-B

uty

lci

nnam

ate

1.5

34

.05

0.3

18

34

11

01

6-7

8-0

3-C

hlo

rob

enzo

ph

eno

ne

1.5

53

.97

0.3

25

34

29

0-6

0-8

3,5

-Dic

hlo

rosa

licy

lald

ehy

de

1.5

53

.07

0.3

34

34

31

67

1-7

5-6

Hep

tano

ph

eno

ne

1.5

64

.23

0.3

18

34

45

91

-35

-53

,5-D

ich

loro

ph

eno

l1

.56

3.6

10

.325

34

56

20

-88

-24

-Nit

rop

hen

yl

ph

eny

let

her

1.5

83

.83

0.3

30

34

68

27

-23

-62

,4-D

ibro

mo

-6-n

itro

anil

ine

1.6

23

.63

0.3

52

34

77

14

7-8

9-9

4-C

hlo

ro-6

-nit

ro-3

-cre

sol

1.6

32

.93

0.3

43

348

771-6

1-9

Pen

tafl

uoro

phen

ol

1.6

33.2

30.3

45

34

91

13

8-5

2-9

3,5

-Di-

tert

-bu

tylp

hen

ol

1.6

45

.13

0.2

98

(9)

35

09

0-5

9-5

3,5

-Dib

rom

osa

licy

lald

ehy

de

1.6

53

.42

0.3

38

35

18

8-3

0-2

3-T

rifl

uo

rom

eth

yl-

4-n

itro

ph

eno

l1

.65

2.7

70

.352

35

26

64

1-6

4-1

4,5

-Dic

hlo

ro-2

-nit

roan

ilin

e1

.66

3.2

10

.345

35

37

0-3

4-8

2,4

-Din

itro

-1-fl

uo

roben

zen

e1

.71

1.4

70

.375

(7)

354

69212-3

1-3

2-(

Ben

zylt

hio

)-3-n

itro

pyri

din

e1.7

23.4

20.3

35

35

55

34

-52

-14

,6-D

init

ro-2

-met

hy

lphen

ol

1.7

32

.12

0.3

66

35

66

09

-89

-22

,4-C

hlo

ro-6

-nit

rop

hen

ol

1.7

53

.07

0.3

54

35

73

48

1-2

0-7

2,3

,5,6

-Tet

rach

loro

anil

ine

1.7

64

.10

0.3

30

35

83

21

7-1

5-0

4-B

rom

o-2

,6-d

ich

loro

ph

eno

l1

.78

3.5

20

.329

35

98

79

-39

-02

,3,4

,5-T

etra

chlo

ron

itro

ben

zen

e1

.78

3.9

30

.361

36

05

38

-68

-1n-A

mylb

enze

ne

1.7

94

.90

0.2

84

(4)

36

11

36

-77

-64

-Hex

ylr

eso

rcin

ol

1.8

03

.45

0.3

06

36

24

09

7-4

9-8

4-(

tert

)-B

uty

l-2

,6-d

init

rop

hen

ol

1.8

03

.61

0.3

67

(5)

36

33

05

-85

-12

,6-I

odo

-4-n

itro

ph

eno

l1

.81

3.5

20

.353

36

41

17

-18

-02

,3,5

,6-T

etra

chlo

ron

itro

ben

zen

e1

.82

4.3

80

.360

(6)

36

53

14

-41

-02

,3,4

,6-T

etra

fluo

ron

itro

ben

zen

e1

.87

1.8

60

.372

36

61

67

4-3

7-9

Oct

anop

hen

on

e1

.89

4.7

50

.318

(6)

36

77

71

-69

-71

,2,3

-Tri

flu

oro

-4-n

itro

ben

zen

e1

.89

2.0

10

.362

36

81

18

-79

-62

,4,6

-Bro

mop

hen

ol

1.9

14

.08

0.3

34

36

96

34

-83

-32

,3,4

,5-T

etra

chlo

roan

ilin

e1

.96

4.2

70

.333


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

TA

BL

EI

–co

nti

nued

No

.C

AS

Nam

elo

gðI

GC2

15

0Þ

log

Kow

Am

ax

Eq

ua

tio

ns *

37

05

70

7-4

4-8

4-E

thy

lbip

hen

yl

1.9

75

.06

0.2

88

37

19

5-9

4-3

1,2

,4,5

-Tet

rach

loro

ben

zen

e2

.00

4.6

30

.331

(9)

37

28

7-8

6-5

Pen

tach

loro

phen

ol

2.0

75

.18

0.3

43

37

39

5-9

5-4

2,4

,5-T

rich

loro

ph

eno

l2

.10

3.7

20

.330

37

47

09

-49

-92

,4-D

init

ro-1

-io

do

ben

zen

e2

.12

2.5

00

.359

37

59

7-0

0-7

1-C

hlo

ro-2

,4-d

init

rob

enze

ne

2.1

62

.14

0.3

74

37

65

8-9

0-2

2,3

,4,6

-Tet

rach

loro

ph

eno

l2

.18

3.8

80

.337

37

76

28

4-8

3-9

1,3

,5-T

rich

loro

-2,4

-din

itro

ben

zen

eh

emih

yd

rate

2.1

92

.97

0.3

85

37

86

30

6-3

9-4

1,2

-Dic

hlo

ro-4

,5-d

init

rob

enze

ne

2.2

12

.93

0.3

65

(11

)3

79

28

68

9-0

8-9

1,5

-Dic

hlo

ro-2

,3-d

init

rob

enze

ne

2.4

22

.85

0.3

69

38

01

04

-40

-5N

ony

lph

eno

l2

.47

5.7

60

.300

(3)

38

15

76

-55

-63

,4,5

,6-T

etra

bro

mo

-2-c

reso

l2

.57

4.9

70

.336

38

22

67

8-2

1-9

1,3

-Din

itro

-2,4

,5-t

rich

loro

ben

zen

e2

.60

3.0

50

.385

(3)

38

36

08

-71

-9P

enta

bro

mo

ph

eno

l2

.66

4.8

50

.346

(3)

38

44

90

1-5

1-3

2,3

,4,5

-Tet

rach

loro

ph

eno

l2

.72

4.2

10

.339

(7)

38

52

00

98

-38

-81

,4-D

init

rote

trac

hlo

rob

enze

ne

2.8

23

.44

0.3

80

*T

he

nota

tion

inth

isco

lum

nin

dic

ates

whic

hco

mpounds

ente

red

the

rele

van

ttr

ainin

gse

t.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

reported in Table IV. In each case the slope is close to one and the intercept near zero

indicating a near ideal relationship.

The luxury of having a toxicity database of 385 compounds affords the opportunity to

perform comparisons not normally possible. The relationships between observed and

predicted toxicity for those compounds not within the pre-defined test sets (in other words all

the remaining compounds in the data set which were not part of the test and training sets) are

reported in Table V. These results revealed that in all cases the QSAR was a good predictor of

toxic potency of the remaining derivatives.

DISCUSSION

The determination of the quality of a QSAR is frequently a daunting task. This is due to the

fact that structure–toxicity relationships are estimations of intricate processes, which, for the

most part, are not known in detail [24]. It is obvious that definition of the quality of a QSAR

FIGURE 2 Plot of log Kow against vs. Amax. The values of both descriptors are standardized in the range (0,1).N train ¼ 15 and N test ¼ 45.

FIGURE 1 Plot of observed toxicity vs. toxicity predicted by Eq. (2).


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

goes further than the determination of a highly significant statistical fit. While a high quality

QSAR can only be constructed and validated with high quality biological and physico-

chemical data [8], a high quality QSAR must be applicable to further compounds of interest.

For example, a high quality QSAR can be developed and validated for a congeneric series of

aliphatic alcohols (see Refs. [25–28]). Such hydrophobic-dependent QSARs are transparent

and mechanistic interpretable. However, because of the narrow molecular domain on which

they are founded such QSARs are of limited applicability [29]. Thus, the correct formal

selection of training and validating data is at the heart of the issues of QSAR quality and

applicability.

It is our collective opinion that training sets in QSAR development should be selected so as

to make best use of diversity and thus optimize applicability. The goal in this study was to

investigate the role of the size of the relative data sets required to achieve maximal coverage

of the descriptor space, with as few chemicals as possible. To enable this aim and to allow

applicability of the QSAR, diversity-based selection of the training set ensured that

validation will be interpolation within the physico-chemical domain and not extrapolation

from outside the domain (see Fig. 2 as an example).

Dissimilarity-based procedures such as the DOE methodology applied in this study have

been utilized successfully for the selection of informative training sets in the QSAR analysis

of both pharmacological [30,31] and toxicological [32,33] endpoints. In conjunction with

(fractional) factorial and D-optimal design, they have been used to maximize the volume of

the descriptor space covered by the training set [14]. In compound selection, these

TABLE II Diversity/representativiy criteria for the training set, selected by DOE and test set, selected by KNN

Training set Test set

Ntrain Dave Dmax Gopt Ntest Dave Dmax Gopt Dc-c

10 0.398 0.600 0.663 30 0.275 0.582 0.473 0.13315 0.365 0.607 0.601 45 0.304 0.602 0.505 0.15420 0.354 0.613 0.577 60 0.283 0.580 0.488 0.16725 0.345 0.609 0.567 75 0.285 0.592 0.481 0.14030 0.351 0.584 0.601 90 0.290 0.590 0.492 0.10435 0.336 0.584 0.575 105 0.286 0.589 0.486 0.12040 0.328 0.598 0.548 120 0.273 0.588 0.464 0.15245 0.324 0.603 0.537 135 0.273 0.606 0.450 0.15250 0.322 0.606 0.531 150 0.267 0.614 0.435 0.152

The numbers were derived from standardized (0, 1) variables (Dave—average distance to the centroid of the set, Dmax—maximumdistance to the centroide of the set, Gopt ¼ Dave/Dmax, Dc-c—distance between the centroids of the training and test set).

TABLE III Coefficients (standard error in the parentheses) and statistics of the model TOX ¼ a £ log P þ b £Amax 2 c

Ntrain a b c s r2ðadj:Þ r2

cv F Equation

10 0.604 (0.040) 17.15 (2.45) 26.196 (0.790) 0.244 0.970 0.938 149 (3)15 0.591 (0.038) 16.23 (2.28) 25.900 (0.746) 0.258 0.954 0.927 147 (4)20 0.590 (0.034) 15.30 (1.84) 25.628 (0.596) 0.248 0.952 0.932 189 (5)25 0.574 (0.033) 13.62 (1.75) 25.091 (0.571) 0.259 0.940 0.922 189 (6)30 0.569 (0.036) 15.72 (1.95) 25.704 (0.631) 0.313 0.922 0.906 172 (7)35 0.571 (0.034) 15.07 (1.75) 25.519 (0.571) 0.302 0.915 0.900 184 (8)40 0.561 (0.032) 15.53 (1.73) 25.682 (0.569) 0.306 0.907 0.893 191 (9)45 0.554 (0.030) 15.24 (1.59) 25.557 (0.520) 0.295 0.909 0.897 221 (10)50 0.557 (0.030) 15.82 (1.52) 25.751 (0.492) 0.300 0.905 0.894 236 (11)

The Ntrain is obtained in DOE procedure.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

procedures involve the identification of a subset comprising the n most dissimilar molecules

in a database containing N molecules, typically where n ! N [21]. A further technique for

compound selection was described in the recent paper of Golbraikh and Tropsha [34] who

examined the use of sphere-exclusion algorithms for the rational selection of training and test

sets used in the development of QSAR models. One drawback of sphere exclusion

algorithms, that it is not possible to specify the size of the subset, is overcome in the

maximum-dissimilarity algorithms, one of which (distance-based optimality) was used in

this study.

While selection of the training set of chemicals for a QSAR should be based on diversity,

we feel that selection of the chemicals for validation should be based on representivity.

Representivity methods, such as KNN, ensure that the validation chemicals mimic the

distribution of all the chemicals within the descriptor space and thus reflect, and allow, only

interpolation within the training space (see Fig. 2 as an example).

There is no common opinion regarding the optimal distribution of chemicals between the

training and validation subsets. Brown and Martin [15] suggested that n should be

approximately 0.2 N if a subset is to be fully representative of its parent database, while

Matter [16] suggest a value approximately 0.35 N. For the purposes of this investigation

n was selected to be 0.25 N. The selection of the training set to be 25% of the combined

training and test sets, is within the 20–35% range reported in the literature and allows for

easy determination of the number of compounds in the validation subset.

KNN was utilized to obtain a representative test set. One drawback of a clustering

technique such as KNN is that it is not possible to control cluster size. Thus, in selecting

compounds for toxicity testing for validation of a QSAR, “empty” or “exhausted” (i.e. ones

with insufficient compounds) clusters may become apparent. To illustrate this point with an

example, with a 25-cluster scheme based on the data presented in this study, 8 clusters with

more than 20 derivatives were obtained, but also 6 clusters with less than 4 derivatives. The

latter 6 clusters clearly pose a potential problem in the selection of validation chemicals. As

the number of clusters increased so did the number of clusters with relatively few

derivatives. As a practical approach to circumvent the problem of “empty” clusters or those

“exhausted” of compounds, in some cases validation derivatives were selected the cluster

nearest to the “empty” or “exhausted” one. To achieve this, the distances between the

cluster centroids were used to reveal the nearest cluster. If the nearest cluster was also

“empty” or exhausted of derivatives the procedure was repeated to find the nearest cluster

containing derivatives.

The basis of this study was the understanding that the training set should be

maximally diverse and the test set should be representative of the training set in terms of

TABLE IV Statistics of the relationship between observed and calculated toxicity, with intercept and through theorigin for the selected by KNN test set

TOXðobsÞ ¼ a £ TOXðcalcÞ þ b TOXðobsÞ ¼ a £ TOXðcalcÞ

Ntest r 2 Slope Intercept r 2* Slope*

30 0.758 0.977 20.128 0.838 0.90245 0.819 0.959 20.057 0.890 0.92860 0.797 0.973 20.051 0.876 0.94275 0.811 1.059 20.063 0.893 1.01990 0.843 1.042 20.078 0.906 0.995105 0.821 1.042 20.079 0.898 0.991120 0.815 1.016 20.031 0.904 0.996135 0.821 1.031 20.058 0.899 0.992150 0.834 0.999 20.047 0.904 0.969


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

the physico-chemical descriptors (i.e. without reference to toxicity). For quantitative

estimation of how well this goal was met, two criteria—Gopt and Dc-c, as defined in Table II,

were used. Ideally, the training and test set should have a common centroid ðDc–c ¼ 0Þ and

equivalent Gopt values. However, according to the one of the main rules for development of

predictive QSARs, namely that the predictions should be interpolated and not extrapolated

out of the chemical domain of the model [8], it is acceptable for the Gopt (test) to be slightly

lower than the Gopt (train). Since practical application will seldom give the ideal, theoretical

result, it may be assumed that the “best” test set is that which gives the lowest Dc-c distance

and has a Gopt value closest to that of the training set. Quantitative criteria such as Gopt and

Dc-c to assess training and test sets were employed as the KNN method allows the choice of

different chemicals for the validation set. This is especially the case and useful when large

numbers of clusters, many with few compounds, were analyzed (i.e. the case of the 8 clusters

containing more than 20 compounds in the 25-cluster scheme referred to above).

The diversity and representativity criteria for training and test sets are shown in Table II.

For the smallest number of compounds used, N train ¼ 10 and N test ¼ 30; the distance

between the centroids of the training and test set is relatively low. However, these two sets are

characterized by the greatest difference (i.e. about 0.2) between Gopt (train) and Gopt (test).

All other selections of chemicals have lower difference between ðGopt ðtrainÞ2 Gopt ðtestÞÞ

(i.e. about 0.1) and slightly greater Dc-c values. Further examination of these data lead to the

conclusion that for training set sizes more than 15 (i.e. N train ¼ 15; N test ¼ 45) (see Fig. 2)

the training and test sets selected are approximately equivalent in terms of representativity.

In all cases Gopt (test) is lower than Gopt (train), which indicates that the predictions for the

derivatives in the validation subset will not be the result of extrapolation to descriptor space

outside of that covered by the training set. Thus, the information in Table II and Fig. 2

indicates that, despite not being ideal, the methodology used resulted in acceptable selection

of diverse training, and representative, testing sets. Moreover, it shows that the Gopt and Dc-c

criteria can be used successfully for assessment of diversity/representivity in QSAR

development.

The modeling of the toxicity by the DOE-selected training sets of varying sizes resulted in

QSARs similar in terms of coefficients and statistical criteria (Table III). Only two models,

N train ¼ 10 and N train ¼ 25; differ slightly from others in terms of the regression coefficient

on Amax. As the size of the training sets increase, the coefficients for log Kow and Amax

decrease slightly reflecting more closely the QSAR for the complete data set described in

Eq. (2). For all the models there is excellent statistical fit, although it should be noted that the

r 2 remains unrealistically high for most of the models in terms of the likely biological error

in the data. There is a systematic decrease in r 2 of the statistical fit as training set size

increases. However, for assessment of the robustness of the models, the predictivity of the

QSAR, i.e. its ability to predict the toxicity of the compounds in the test set accurately, is of

greater importance.

For assessment of predictivity of the toxicity of the chemicals in the test set, for the models

listed in Table III, two regression procedures were applied [35]. The first (regression with

intercept) provides the actual relationship in terms of a regression equation between observed

and predicted toxicity. However, the simultaneous variation in the slope and the intercept

makes the comparison between models difficult so a second regression procedure, regression

through the origin, (i.e. no intercept) was applied. Despite the fact that the coefficient of

determination in the fit through the origin (i.e. r 2*) can not be compared directly to that for

the fit with intercept (i.e. r 2), the former is more convenient for comparison between the

models, since it accounts for the spread of the toxic potency around the ideal line (i.e. the line

with a slope of 1 and an intercept of 0). In this case the term Slope* accounts for the deviation

of the real regression line from the ideal one.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

For the prediction of the toxicity of the test sets selected by the KNN method, the

statistical fit in terms of r 2 values is quite acceptable for that required to predict toxic

potency (Table IV). Increasing the size of the training set greater than N train ¼ 10;appears to provide no significant improvement in the prediction of toxicity.

Further testing on the model predictivity was performed on the compounds, which were

not selected for either the training or the test set (Table V). It should be stressed that this

check of predictivity will not be available when compiling new data (i.e. if this analysis is

performed de novo ). Comparison of the results in Table V with those in Table IV indicates

that the toxicity of the compounds not included in the training and test sets is better predicted

than that for those selected by KNN for the test set (higher r 2 and r 2*). The comparison of

the slopes in the regression through the origin shows that in all the models for the remaining

compounds, the Slope* is lower, which indicates that for those compounds the toxicity will

be over-predicted. However, this may be the more desirable situation, especially for the risk

assessment.

It was evident from the model parameters and validation statistics that there is not

significant improvement (either in statistical fit or predictivity) between the models based on

15–50 derivatives as selected by DOE. However, below 15 compounds in the training set, the

noise, or error in both the toxicological data and chemical descriptors has a bigger influence

on the quality of the model.

In conclusion, training sets for QSAR development should be selected so as to make best

use of diversity, such as by the DOE methodology, so as to optimize applicability. Moreover,

diversity-based selection of the training set ensures that validation will be a process of

interpolation. It is desirable that the selection of validation chemicals is based on the

representivity of the training set since methods, such as KNN, ensure that the validation

chemicals mimic the distribution of the chemicals in the training set within the descriptor

space. Moreover, a training to test ratio of 1: 3 appears adequate (and indeed rigorous) for

validation. This study reveals that with appropriate selection of chemicals it is possible to

maximize the coverage of the descriptor space with a relatively small number of chemicals to

train and validate a high quality QSAR. Specifically to this study, a chemical universe of 385

compounds was represented by 60 chemicals (15 for training and 45 for validation). This

approach thus reduces the time and resources required for testing without reducing the

quality of the QSAR. Despite the fact that the methodology described in this investigation

was based upon a two-descriptor regression model, we believe that it is also applicable to

more multivariate QSAR analyses. From the findings in this paper, as a conservative

recommendation (and building upon the original rule of thumb of Topliss and Costello [10]),

TABLE V Statistics of the relationship between observed and calculated toxicity, with intercept and through theorigin for the derivatives, which do not participate in either the training or test sets

TOXðobsÞ ¼ a £ TOXðcalcÞ þ b TOXðobsÞ ¼ a £ TOXðcalcÞ

Ntest r 2 Slope Intercept r 2* Slope*

345 0.866 0.899 20.088 0.918 0.836325 0.854 0.929 20.091 0.910 0.859305 0.861 0.950 20.084 0.914 0.883285 0.847 0.983 20.080 0.905 0.913265 0.851 0.906 20.083 0.912 0.832245 0.866 0.934 20.084 0.916 0.859225 0.860 0.933 20.044 0.912 0.891205 0.868 0.950 20.076 0.916 0.876185 0.848 0.944 20.064 0.909 0.880


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

it is suggested that there should be a minimum of 10 observations for each variable in a

QSAR. Further investigations are required in this to confirm this finding to other endpoints

(both toxicological and pharmacological) and for other modeling techniques such as PLS and

neural networks.

Acknowledgements

Toxicity data acquisitions were supported in part by a grant from The University of

Tennessee Center of Excellence in Livestock Disease and Human Health. The European

Union IMAGETOX Research Training Network (HPRN-CT-1999-00015) supported

Dr Netzeva. Gratitude is expressed to Mr Glendon Sinks for his assistance with toxicity

analyses.

References

[1] Worth, A.P. and Balls, M. (2002) “Alternative (non-animal) methods for chemical testing: current status andfuture aspects”, ATLA 30(suppl. 1), 1–125.

[2] Walker, J.D. and Schultz, T.W. (2002) “Structure activity relationships for predicting ecological effects ofchemicals”, In: Hoffman, D.J., Rattner, B.A., Burton, Jr., G.A. and Cairns, Jr., J., eds, Handbook ofEcotoxicology 2nd Ed. (CRC Press, Boca Raton), (in press).

[3] Schultz, T.W., Cronin, M.T.D., Walker, J.D. and Aptula, A.O. (2003) “Quantitative structure–activityrelationships (QSARs) in toxicology: a historical perspective”, J. Mol. Struct.—Theochem., (in press).

[4] Karabunarliev, S., Mekenyan, O.G., Karcher, W., Russom, C.L. and Bradbury, S.P. (1996) “Quantum-chemicaldescriptors for estimating the acute toxicity of substituted benzenes to the guppy (Poecilia reticulata ) andfathead minnow (Pimephales promelas )”, Quant. Struct.–Act. Relat. 15, 311–320.

[5] Schultz, T.W. (1999) “Structure–toxicity relationships for benzenes evaluated with Tetrahymena pyriformis”,Chem. Res. Toxicol. 12, 1262–1267.

[6] Cronin, M.T.D. and Schultz, T.W. (2001) “Development of quantitative structure–activity relationships for thetoxicity of aromatic compounds to Tetrahymena pyriformis: Comparative assessment of methodologies”,Chem. Res. Toxicol. 14, 1284–1295.

[7] Seward, J.R., Cronin, M.T.D. and Schultz, T.W. (2001) “Structure–toxicity analyses of Tetrahymena pyriformisexposed to pyridine—an examination into extension of surface-response domains”, SAR QSAR Environ. Res.11, 489–512.

[8] Schultz, T.W. and Cronin, M.T.D. (2003) “Essential and desirable characteristics of ecotoxicity QSARs”,Environ. Toxicol. Chem., In press.

[9] Cronin, M.T.D., Aptula, A.O., Duffy, J.C., Netzeva, T.I., Rowe, P.H. and Valkova, I.V. (2002) “Comparativeassessment of methods to develop QSARs for the prediction of the toxicity of phenols to Tetrahymenapyriformis”, Chemosphere, 49, 1201–1221.

[10] Topliss, J.G. and Costello, J.D. (1972) “Chance correlations in structure–activity studies using multipleregression analysis”, J. Med. Chem. 15, 1066–1069.

[11] Liu, R., Sun, H. and So, S.-S. (2001) “Development of quantitative structure–property relationships models forearly ADME evaluation in drug discovery 2. Blood–brain barrier penetration”, J. Chem. Inf. Comput. Sci. 41,1623–1632.

[12] Cronin, M.T.D., Aptula, A.O., Dearden, J.C., Duffy, J.C., Netzeva, T.I., Patel, H., Rowe, P.H., Schultz, T.W.,Worth, A.P., Voutzoulidis, K. and Schuurmann, G. (2002) “Structure-based classification of antibacterialactivity”, J. Chem. Inf. Comput. Sci. 42, 869–878.

[13] Aptula, A.O., Netzeva, T.I., Valkova, I.V., Cronin, M.T.D., Schultz, T.W., Kuhne, R. and Schuurmann, G.(2002) “Multivariate discrimination between modes of toxic action of phenols”, Quant. Struct.-Act. Relat. 21,12–22.

[14] Eriksson, L. and Johansson, E. (1996) “Multivariate design and modeling in QSAR”, Chemom. Intell. Lab. Syst.34, 1–19.

[15] Brown, R.D. and Martin, Y.C. (1997) “The information content of 2D and 3D structural descriptors relevant toligand-receptor binding”, J. Chem. Inf. Comput. Sci. 37, 1–9.

[16] Matter, H. (1997) “Selecting optimally diverse compound from structure databases: a validation study of two-dimensional and three dimensional molecular descriptors”, J. Med. Chem. 40, 1219–1229.

[17] Schultz, T.W. (1997) “TETRATOX: The Tetrahymena pyriformis population growth impairment endpoint. Asurrogate for fish lethality”, Toxicol. Methods 7, 289–309.

[18] Bradbury, S.P., Russom, C.L., Ankley, G.T., Schultz, T.W. and Walker, J.D. (2003) “QSARs for predictingecological effects of organic chemicals”, Environ. Toxicol. Chem., (In press).

[19] SAS (Statistical Analysis System) Institute, Inc. (1989) SAS/STAT User’s Guide, 4th Ed. Vol. 2, version 6,“SAS Institute Inc.” p 846.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

[20] Kennard, R.W. and Stone, L.A. (1969) “Computer aided designs of experiments”, Technometrics 11, 137–148.[21] Snarey, M., Terrett, N.K., Willet, P. and Wilton, D.J. (1997) “Comparison of algorithms for dissimilarity-based

compound selection”, J. Mol. Graphics Mod. 15, 372–385.[22] Johnson, R.A. and Wichern, D.W. (1998) “Clustering, distance methods, and ordination”, In: Johnson, R.A. and

Wichern, D.W., eds, Applied Multivariate Statistical Analysis (Prentice Hall, New Jersey), pp 726–797.[23] Lipnick, R.L. (1991) “Outliers: their origin and use in the classification of molecular mechanisms of toxicity”,

Sci. Total Environ. 109/110, 131–153.[24] Nendza, M. and Russom, C.L. (1991) “QSAR modeling of the ERL-D fathead minnow acute toxicity

database”, Xenobiotica 21, 147–170.[25] Konemann, H. (1981) “Quantitative structure–activity relationships in fish toxicity studies. Part I: A

relationship for 50 industrial pollutants”, Toxicology 19, 209–221.[26] Veith, G.D., Call, K. and Brooke, L. (1983) “Structure–toxicity relationships for the fathead minnow,

Pimephales promelas: narcotic industrial chemicals”, Can. J. Fish. Aquat. Sci. 40, 743–748.[27] Hansch, C., Kim, D., Leo, A.J., Novellino, E., Silipo, C. and Vittoria, A. (1989) “Toward a quantitative

comparative toxicology of organic compounds”, Crit. Rev. Toxicol. 19, 185–226.[28] Schultz, T.W. and Tichy, M. (1993) “Structure–toxicity relationships for unsaturated alcohols to Tetrahymena

pyriformis: C5 and C6 analogs and primary propargylic alcohols”, Bull. Environ. Contam. Toxicol. 51,681–688.

[29] Kaiser, K.L.E., Dearden, J.C., Klein, W. and Schultz, T.W. (1999) “A note of caution to users of ECOSAR”,Water Qual. Res. J. Can. 34, 179–182.

[30] Norinder, U. and Hogberg, T. (1992) “A quantitative structure–activity relationship for some dopamine D”antagonists of benzamide type”, Acta. Pharm. Nord. 4, 73–78.

[31] Belvisi, L., Bravi, G., Catalano, G., Mabilia, M., Salimbeni, A. and Scolastico, C. (1996) “A 3D QSAR CoMFAstudy of non-peptide angiotensin II receptor antagonists”, J. Comput.-Aided Mol. Des. 10, 567–582.

[32] Blaha, L., Damborsky, J. and Nemec, M. (1998) “QSAR for acute toxicity of saturated and unsaturatedhalogenated aliphatic compounds”, Chemosphere 36, 1345–1365.

[33] Harju, M., Andersson, P.L., Haglund, P. and Tysklind, M. (2002) “Multivariate physicochemicalcharacterization and quantitative structure–property relationship modeling of polybrominated diphenylethers”, Chemosphere 47, 375–384.

[34] Golbraikh, A. and Tropsha, A. (2002) “Predictive QSAR modeling based on diversity sampling of experimentaldatasets for the test and training test selection”, J. Comput.-Aided Mol. Des., (in press).

[35] Golbraikh, A. and Tropsha, A. (2002) “Beware of q2!”, J. Mol. Graphics Mod. 20, 269–276.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a Sa

nta

Cru

z] a

t 15:

56 2

4 N

ovem

ber

2014

selection of data sets for qsars: analyses of tetrahymena toxicity from aromatic compounds

Documents