computational neural network analysis of the affinity of 2-pyridyl-3,5-diaryl pyrroles analogs for...

ORIGINAL RESEARCH

Computational neural network analysis of the affinityof 2-pyridyl-3,5-diaryl pyrroles analogs for the humanglucagon receptor using density functional theory

Mohsen Shahlaei • Zohreh Nazari

Received: 4 June 2013 / Accepted: 19 September 2013 / Published online: 2 October 2013

� Springer Science+Business Media New York 2013

Abstract In our continuing efforts to provide a predictive

quantitative structure activity relationship using different

algorithms, radial basis function neural networks (RBFNN)

have been successfully combined with principal compo-

nent analysis (PCA) and trained to predict the biological

activity (pIC50) of 2-pyridyl-3,5-diaryl pyrrole derivatives

as human glucagon receptor antagonists. A set of quantum

descriptors, including energy of HOMO, energy of LUMO,

softness, hardness, etc. descriptors, were calculated using

DFT-B3LYP method, with the basis set of 6-311G. An

ANN with 1-15-1 architecture was generated using eight

principal components. A principal component regression

(PCR) model was also developed for comparison. It was

found that a properly selected and trained RBFNN with a

suitable training set could represent the dependence of the

biological activity on the principal components that were

calculated using quantum descriptors fairly well. For

evaluation of the predictive ability of the developed PCA-

based RBFNN model, an optimized network was applied to

predict the pIC50s of compounds in the test set, which were

not used in the modeling phase of the procedure. A squared

correlation coefficient (R2) and root mean square error of

0.161 and 0.874 for the test set by the PCR model should

be compared with the values of 0.999 and 0.0154 by the

principal component based RBFNN model. These

improvements are due to the fact that the pIC50s of

2-pyridyl-3,5-diaryl pyrrole derivatives show non-linear

correlations with the principal component extracted from

the quantum descriptors.

Keywords Glucagon receptor inhibition activity �2-Pyridyl-3,5-diaryl pyrrole derivatives �Radial basis function neural network �Density functional theory

Introduction

The prediction of pharmaceutical and biological activities,

physicochemical and pharmacokinetic properties/activities

of small molecules is the main goal of quantitative struc-

ture–property/activity relationships (QSPRs/QSARs)

(Cronce et al., 1998; Arkan et al., 2010; Saghaie et al.,

2010a, 2011; Shahlaei and Fassihi, 2013; Shahlaei et al.,

2010a, 2011a, c; Shahlaei and Pourhossein, 2012, 2013;

Shahlaei et al., 2010b; Shahlaie et al., 2013). A typical

QSAR model is developed on the basis of the correlation

between the experimental biological activity and structural

descriptors reflecting the molecular structure of the com-

pounds of interest. Since these structural descriptors are

determined solely from computational techniques, a priori

prediction of the activities of compounds is feasible, no

laboratory measurements are required, so this saves time,

chemicals, space, instrument and alleviating safety (toxic-

ity), and disposal concerns. For many years, QSAR models

have been efficiently employed for the study of biological

mechanisms of various reactive compounds (Arkan et al.,

2010; Saghaie et al., 2010b, 2011, 2013; Shahlaei and

Fassihi, 2013; Shahlaei et al., 2010a, 2011a, b, c; Shahlaei

and Pourhossein, 2012, 2013; Shahlaei et al., 2010b;

Shahlaie et al., 2013). To obtain a significant correlation

M. Shahlaei (&)

Novel Drug Delivery Research Center, School of Pharmacy,

Kermanshah University of Medical Sciences,

67346-67149 Kermanshah, Iran

e-mail: [email protected]; [email protected]

Z. Nazari

Student Research Committee, Kermanshah University of

Medical Sciences, Kermanshah, Iran

123

Med Chem Res (2014) 23:2046–2061

DOI 10.1007/s00044-013-0801-3

MEDICINALCHEMISTRYRESEARCH

between dependant variables (descriptors) and a dependant

variable (biological activity), it is crucial that suitable

descriptors are used (Karelson et al., 1996), because the

success of a QSAR model is highly dependent on the

selection of significant descriptors. In recent years, a

number of quantum-chemical descriptors such as charges,

orbital energies, frontier orbital densities, and dipole

moment, etc. estimated from density functional theory

(DFT) calculations have been employed in a successful

manner in developing different QSAR models for pre-

dicting biological activity in terms of the structures and

physicochemical properties of compounds (Pasha et al.,

2005; Saghaie et al., 2013).

Quantum-chemical descriptors, which can be obtained by

calculation, can describe defined molecular activities, and

are not restricted to closely related structural molecules.

Therefore, in recent years, the development of QSAR

models based on quantum-chemical descriptors has gained

significant interest (Karelson et al., 1996). Recently, it has

been reported that some comparative QSAR models, using

the descriptors calculated using the DFT approach instead of

the semi-empirical techniques AM1 or PM3, can improve

the accuracy of the results and lead to more reliable QSAR

models. A QSAR work by Arulmozhiraja and Morita (2004)

which studied relationships between the various DFT-based

descriptors (chemical softness, electronegativity, and elec-

trophilicity index) and the toxicity of 33 polychlorinated

dibenzofurans (PCDFs) showed a moderate to satisfactory

success for the DFT-based reactivity descriptors in the

toxicological QSARs. Pasha et al. (2005) studied quantum-

chemical descriptors based QSAR models on toxicity of

phenol derivatives with AM1, PM3, PM5, and DFT meth-

ods, indicating that the DFT method is more reliable than

others and has an improved predictive ability.

Various statistical and mathematical approaches for

building QSAR models have been applied including mul-

tiple linear regression (MLR), principal component ana-

lysis (PCA), and partial least-squares regression (PLS)

(Fassihi et al., 2012; Shahlaei et al., 2010a, 2011b). In

addition, artificial neural networks (ANNs) have become

well-known and popular due to their ability where complex

non-linear relationships exist between dependent and

independent variables (Arkan et al., 2010; Saghaie et al.,

2010a; Shahlaei et al., 2010a; Shahlaei and Pourhossein,

2012; Shahlaei et al., 2010b). ANNs are biologically

inspired computational algorithms designed to simulate the

way in which the human brain processes input data and

extracts valuable information. ANNs do not necessitate

explicit formulation of the mathematical or physical rela-

tionships between dependent and dependent variables of

the handled problem. These give ANN an advantage over

other mathematical and statistical regression approaches

for some chemical applications. For the reasons mentioned

above, in recent years, various ANNs algorithms have been

used to an extensive variety of chemical problems. There

are several types of ANNs algorithms that have been

developed by now and new ones are invented every week

(Agatonovic-Kustrin and Beresford, 2000). The behavior

and response generated by a typical ANNs algorithm is

determined by transfer functions of its neurons, by learning

rules, and by the architecture itself. A typical ANNs

include artificial neurons or processing elements, con-

nected with weights, which constitute the network structure

and are organized in layers.

The extensive applications of ANNs in science stem

from their flexibility and power to model non-linear sys-

tems without prior knowledge of an empirical model.

The main goal of the current study is to develop a QSAR

model based on quantum descriptors using radial basis

function-ANNs, for modeling and predicting human glu-

cagon receptor antagonist activities values of 2-pyridyl-

3,5-diaryl pyrrole derivatives. In the first step, a PCR

model was constructed. Then, for inspection of non-linear

relation between principal components calculated using

quantum descriptors, an ANN model was developed for

predicting the pIC50 values and the results were compared

with the experimental and calculated values using the PCR

model.

Methods

Calculation of quantum descriptors

The basic skeletons of studied compounds and details were

summarized in Table 1. The initial structures of all

2-pyridyl-3,5-diaryl pyrrole derivatives were constructed

by the CS Chem3D software (Ultra 10.0, ChemOffice

2006, CambridgeSoft Corporation).

To save computational time, initial geometry optimiza-

tions were carried out with the molecular mechanics (MM)

method using the MM? force fields.

Various computational studies (Yan et al., 2005; Tro-

halaki et al., 2000; Saghaie et al., 2013) indicate that DFT

(Kohn and Sham, 1965; Parr and Yang, 1989) with B3LYP

level of theory (Becke, 1993) method is suitable to the

QSAR studies. In this study, 3,5-pyridyl-3,5-diaryl pyrroles

compounds have been fully optimized using the B3LYP

method and 6-311G basis set. Frequency calculations show

that all molecular structures are stable and correspond to the

minimum point on the potential energy surface. Quantum-

chemical calculations were carried out by the Gaussian 03

(Frisch et al., 2008) program on linux platform.

As listed in Table 2, 18 quantum-chemical descriptors

were used to analyze their variations and efficiency of the

inhibition activity prediction of the compounds of interest.

Med Chem Res (2014) 23:2046–2061 2047

123

Table 1 The general molecular structure and substituent details of compounds used in this study

Y

X

F

N

Ra

RaYXCompd

S(O)Me

HN

N

NHN1

SMeHN

N

NHN2

S(O)MeHN

CH

NHCH3

S(O)Me

HC

NH

CHNH4

SMe

HN

CH

NHCH5

SMe

HC

NH

CHNH6

2048 Med Chem Res (2014) 23:2046–2061

123

Table 1 continued

NH

R1R3

R2

R3R2R1Compd

NFH7

NFCH38

NF9

NF10

NFCl11

NF

Cl

12

NFCl

Cl

13

NFF14

NFOCH315

NFCH316

Med Chem Res (2014) 23:2046–2061 2049

123

Table 1 continued

NFCH2CH317

NFNO218

NFCO2CH2CH319

NFCN20

NFNH221

NFCOOH22

NCl23

NClCl24

NBrCl25

N

Br

Cl26

N

Br

Cl27

NOCH3Cl28

2050 Med Chem Res (2014) 23:2046–2061

123

The quantum descriptors employed in this work, such as

polarizability (a), dipole moment (l), energy of the highest

occupied molecular orbital (EHOMO), energy of the lowest

unoccupied molecular orbital (ELUMO), most negative

atomic charge, most positive charge, etc., have been

obtained directly or indirectly (Table 2) from the Gaussian

output files.

The selection of input variables to ANN is necessary to

avoid ‘‘over fitting’’ (Tetko et al., 1993) in the case of

many input descriptors offered. As a linear technique for

dimensionality reduction, PCA can transform the input

data set from its original form (points in m-dimensional

space) to its new form (points in p-dimensional space),

where p is less than m. During the process, most of the

amount of the variability of the original input data set is

retained. Using the calibrated input data set in a lower

dimension, smaller ANN is applied in the performance of

prediction.

Principal component regression (PCR)

Next, a PCA was performed for variable reduction and data

interpretation. In PCA, descriptors describe the same

property cluster together, hence it is easy to describe the

predicted activity with a less number of independent

variables.

Table 1 continued

N

OCH3

Cl29

N

H3CO

Cl30

N

H3CH2CO

Cl31

N

H3CH2CH2CO

Cl32

N

O

Cl33

N

OH2CCHH3C

H3C

Cl34

N

OH3CH2CH2C

Br

Cl35

Med Chem Res (2014) 23:2046–2061 2051

123

PCR is a standard technique among the multivariate

regression methods available for QSAR studies. In a typi-

cal PCR, a model can be explained as follows; consider the

following equation:

y ¼ Xbþ e ð1Þ

This equation describes the relationship between a set of

descriptors X (regressors) and the pIC50s y (regressands) by

means of a vector b. Note that the vectors y and b are

considered to be column vectors. If k denotes the number

of molecules used in the regression, p the number of

descriptors which are calculated for each molecule, and y

biological activity that has to be regressed, then y is a

vector of length k and X a (k 9 p) matrix containing the

calculated descriptors as row vectors. The regression vector

is represented by b of length p which has to be determined

in the regression step. In this step, the Euclidean length of

the error vector denoted as e has to be minimized by

solving a least-squares problem. The idea of PCR is to

decompose X into a matrix R of column vectors

of k eigens vectors (factors) of length p and a orthogonal

(k 9 k) matrix C containing the scores as rows.

X ¼ CR ð2Þ

The matrix C and the eigenvalues [k] are given by

solving the eigenvalue problem

CT ZC ¼ ½k� ð3Þ

and the eigenvector matrix R (k 9 p) is calculated by

R ¼ CTX: ð4Þ

In Eq. (3) Z ¼ XXT denotes the (k 9 k) covariance

matrix and [k] is the diagonal matrix of eigenvalues. The

rows in CT are the eigenvectors of Z and its columns are the

‘‘scores.’’ The column vectors of the square matrix C are

orthonormal and often called principal components (PCs).

The scores with respect to the factors are the new

orthogonal variables on which the properties are now

regressed instead of the original variables (absorbances at

certain wavelengths). Due to their orthogonality those PCs

which are not assumed to be significant can be omitted, i.e.,

the calculation usually is performed with a reduced matrix

(Cr resp. Rr) for the calibration. Here r denotes the number

of factors included in the model. The regression vector

with respect to the original variables (theoretical molecular

descriptors) can be obtained as follows:

b_

¼ Rþr CTr y;

where Rr? denotes the pseudo inverse of Rr.

The greatest amount of variability of the original

quantum-electronic descriptor data set is represented by the

first principal component (PC), and the second principal

component explains the maximum variances of the residual

data set. Then, the third one will describe the most

important variability of the next residual data set, and so

on. According to the theory of least squares, the eigen-

vectors of all principal components are orthogonal to each

other in multidimension data space defined by quantum-

electronic descriptors. Generally speaking, only p principal

components are enough to account for the most variance in

an m-dimensional data set, where p is the number of

important principal components of the data set, and m

means the number of all the principal components in the

data set. It is obvious that p is less than m. So PCA is

generally regarded as a data reduction technique. That is to

say, a multi-dimensional data set can be projected to a

lower dimension data space without losing most of the

information of the original data set by PCA.

Splitting PCs matrix into training and test sets

At the next step of developing QSAR models, and in order

to develop a reliable (validated) QSAR model, consecutive

molecules are selected and put alternatively in the training

and test sets. The division of an original dataset into the

training and test sets can be carried out using various

algorithms. One of the most popular and successful algo-

rithms is Kennard and Stone algorithm.

Table 2 Quantum-electronic descriptors used in this study

Descriptor abbreviation Descriptor definition

EHOMO The energy of the highest occupied

molecular orbital

ELUMO The energy of the lowest occupied

molecular orbital

H–L The HOMO–LUMO energy gap

Electronegativity v ¼ EHOMO�ELUMO

2

Hardness g ¼ � EHOMO�ELUMO

2

Electrophilicity x ¼ v2

2g

MPC The most positive charge

LNC The least negative charge

SSC The sum of square of charges

SSPC The sum of square of positive charges

SSNC The sum of square of negative charges

SPC The sum of positive charges

SNC The sum of negative charges

SAC The sum of absolute charge

DMx Dipole moment in x direction

DMy Dipole moment in y direction

DMz Dipole moment in z direction

TDM Total Dipole moment

2052 Med Chem Res (2014) 23:2046–2061

123

The Kennard–Stone method selects the molecules which

are furthest from each other in the dataset one by one (Kennard

and Stone, 1969). The quantity employed to measure the

distance is the Euclidean distance. For a response matrix

with N molecules (rows) and K PCs (columns), the multi-

variate Euclidean distance between samples i and j is

Dij ¼ jjxi � xjjj ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

X

k

v¼1

ðxiv � xjvÞ2v

u

u

t

The first step is to select the two furthest molecules

(maximum Dij). The third molecule is picked out by

performing the following steps: the distance between each

molecule and the two furthest molecules are calculated; the

shortest of each of these pairs of distances is selected and

the molecule with the maximum value in this set of

minimal distances is chosen.

Generalizing, if M molecules from the original dataset

with N molecules have been selected, the next molecule

M?1 is chosen by calculating:

diðMÞ ¼ minfDi1;Di2; . . .DiMg

for the N - M molecules that have not been picked out

previously. Of these, the one which complies with the

following equation is selected:

diðM þ 1Þ ¼ maxfdiðMÞg

Radial basis function neural networks (RBFNNs)

In the present study, one type of neural networks, namely

RBFNN, was employed to establish an alternative non-

linear model. The theory of RBFNN has been adequately

described in detail elsewhere (Xiang et al., 2002). so we

will limit ourselves to a brief outline highlighting only the

most important aspects.

Usually, RBFNN comprises three layers, i.e., the input

layer, the hidden layer, and the output layer (Fig. 1). The

input layer does not process the information, since it only

distributes the input vectors to the hidden layer, whereas

the latter consists of a number of RBF units (nh) and biases

(bk). Each neuron on the hidden layer employs a radial

basis function as a non-linear transformation function to

operate on the input data. The frequently used RBF is a

Gaussian function that is characterized by a center and a

width. This function measures the Euclidean distance

between the input vector (X) and the radial basis function

center (cj) and performs the non-linear transformation

within the hidden layer as follows:

hj ¼ expð�jjX � cjjj2=r2j Þ;

where hj denotes the output of the jth RBF unit, while cj

and rj are the center and width of such unit, respectively.

The operation of the output layer is linear and is given by:

ykðXÞ ¼X

nh

i¼1

wkjhjðXÞ þ bk;

where yk is the kth output unit for the input vector X, wkj is

the weight connection between the kth output unit and the

jth hidden layer unit, and bk is the respective bias.

From Eqs. (2) and (3), one can see that the design of

RBFNN involves selecting centers, number of hidden layer

units, widths, and weights. There are various ways for

selecting the centers, such as random subset selection,

k-means clustering, and RBF–PLS. In this study, a forward

subset selection routine was used to select the centers from

the training set samples with regard to the widths of the

radial basis functions, those can either be chosen equal for all

the units or different for each unit. Here, we limited our-

selves to Gaussian functions with a constant width for all the

units. Furthermore, the adjustment of the connection weight

between the hidden layer and output layer was performed

using a least-squares solution after the selection of the RBF

centers and width. All RBFNN calculations were performed

using home-developed scripts using the MATLAB package

(www.mathworks.com/products/matlab/).

Moreover, the overall performance of the final RBFNN

model was evaluated in terms of its root mean squared

error (RMSE), and its goodness and robustness estimated

by the same statistical parameters as those used for the

linear model.

Validation and evaluation

Testing the stability, predictive power, and generalization

ability of the models is a very important step in QSAR

Fig. 1 The typical architecture of the RBF-ANN

Med Chem Res (2014) 23:2046–2061 2053

123

http://www.mathworks.com/products/matlab/

study. As for the validation of predictive power of a QSAR

model, two basic principles (internal validation and exter-

nal validation) are available.

In both validation methods, R2, which presents the

explained variance for given set, was used to determine the

goodness of model’s fit performance. In addition, the pre-

diction performance of the built models must be estimated

in order to build a successful QSAR model. In this study,

the prediction performance of the developed models was

evaluated using two parameters, the RMSE and percent

relative standard error [RSEP (%)].

The cross validation is one of the most popular methods

for internal validation. In this study, the internal predictive

capability of the model was evaluated by leave-one-out

cross validation (Q2LOO). A good Q2LOO often indicates

a good robustness and high internal predictive power of a

QSAR model. However, recent studies of Tropsha and co-

workers [16] indicate that there is no evident correlation

between the value of Q2LOO and actual predictive power

of a QSAR model, revealing that the Q2LOO is still

inadequate for a reliable estimate of model’s predictive

ability for all new compounds. In order to determine both

the generalizability of QSAR models for new compounds

and the true predictive ability of the models, the statistical

external validation can be used at the model development

step by properly employing a prediction set for validation.

The results of data splitting using Kennard-Stone algorithm

are shown in Table 1, as the test set is indicated with an

asterisk.

Also, some criteria by Tropsha were suggested, if these

criteria were met, then it can be said that the model is

predictive [43]. These criteria include:

R2LOO [ 0:5

R2 [ 0:6

R2 � R20

R2\0:1

R2 � R020

R2\0:1

0:85\k\1:15 or 0:85\k0\1:15

R2 is the correlation coefficient of regression between the

predicted and observed activities of compounds in training

and test set. R20 is the correlation coefficients for regres-

sions between predicted versus observed activities through

the origin, R020 is the correlation coefficients for regressions

between observed versus predicted activities through the

origin, and the slope of the regression lines through the

origin is assigned by k and k’, respectively. Details of

definitions of parameters such as R20;R

020 , k and k’ are

presented obviously in literature and are not written again

here for shortness [43].

Also, In addition, according to Roy and Roy [44] the

difference between values of R20 and R

020 must be studied

and given importance. They suggested following modified

R2 form

R2m ¼ R2 1�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

R2 � R20

q

�

�

�

�

�

�

�

�

� �

If R2m value for given model is [0.5, indicates good

external predictability of the developed model.

The actual predictability of each model developed on

the training set is confirmed on an external test set [43] and

is calculated from: R2p ¼ 1� PRESS=SD; where PRESS is

the sum of squared differences between the measured

activity and the predicted value for each compound in the

test set, and SD is the sum of squared deviations between

the measured activity for each molecule in the test set and

the mean measured value of the training set.

Developed models are also tested for reliability and

robustness by Y-randomization testing: new models are

recalculated for randomly reordered response. We provided

evidence that the proposed models are well founded, and

not just the result of chance correlation, by obtaining new

models on randomized response with significantly lower R2

than the original models. If the results show high R2, it

implies that an acceptable QSAR model cannot be

obtained.

Applicability domain of the model

The presence of outliers was confirmed by the Williams

plot. This plot includes response outliers, i.e., molecules

with standardized residuals greater than two standard

deviation units, and structurally influential compounds in

determining figures of merit and statistical parameters of

the developed model, i.e., molecules with high leverage

value (h) greater than warning leverage limit, h* = 3k’/n,

k’ is the number of model variables plus one, and n is the

number of the molecules applied in model development

[22].

In another way, applicability domain of a QSAR model

must be defined and predictions of activity for only those

compounds that fall into this domain may be considered

reliable [21]. Such QSAR models could be used for

screening new compounds. Williams plot could be used for

an immediate and simple graphical detection of both the

response outliers and structurally influential compounds in

a model, i.e., h [ h*. Compounds with h [ h* hardly

affect the goodness of fit of the developed model but these

compounds may do not be an outlier because of low

residual. It must be noted that compounds with high value

of leverage and good fitting in the developed model can

stabilize the model. On the other hand, compounds with

bad fitting in the developed model may be outliers. Thus,

2054 Med Chem Res (2014) 23:2046–2061

123

combination of leverage and the standardized residuals

could be used for assigning the applicability of the domain.

Results and discussion

Interpretation of quantum descriptors

Many molecular properties depend on intermolecular

interactions. The main component of these interactions is

electrostatic in its nature. Electrical charges in the molecule

are simply the driving force of electrostatic interactions.

Charge-based descriptors have therefore been widely used

as chemical reactivity indices or as measures of weak

intermolecular interactions. The charge distributions in a

given molecule and partial charges on the atoms can be

estimated using quantum-chemical calculations. One of the

most important parts of standard output of almost any

quantum calculation is the Mulliken atomic charges

(Mulliken, 1955a, b, c, d). Usually, the minimum (most

negative) and maximum (most positive) atomic partial

charges in the molecule or the minimum or maximum

partial charges for particular types of atoms are employed

as quantum descriptors (Clare and Supuran, 1994; Cartier

and Rivail, 1987). Different sums of absolute or square

values of partial charges (for example sum of positive

charges and sum of square of positive charges) have also

been employed to explain intermolecular interactions.

Other usual charge-based quantum-chemical indices used

as electrostatic descriptors in QSAR models are the aver-

age absolute atomic charge (Clare and Supuran, 1994;

Ordorica et al., 1993) and a polarity parameter defined as

the difference between the values of the most positive and

negative charge (Clare and Supuran, 1994; Cartier and

Rivail, 1987; Clare, 1995).

Electrostatic interactions can also be denoted by the

respective electrical moments and their components. The

polarity is denoted by the dipole moment (l). The polari-

zation of a molecule by an external electric field can be

defined in terms of nth order susceptibility tensors of the

molecule (Sotomatsu et al., 1989). The first order term that

is referred to as the polarizability of the molecule repre-

sents the relative susceptibility of the electron cloud of an

atom or a molecule to be distorted from its normal shape by

the presence of an external field.

Because of this distortion, an induced electric dipole

moment appears. Polarizability (a) is a tensor relating the

induced dipole moment (l ind) to the applied electric field

strength. The non-diagonal elements of the tensor represent

the polarizability of the electrons along one of the axes of

the coordinate system due to a component of the applied

electric field along another of the coordinate axes. As this

effect is insignificant compared to the polarizability in the

direction of the applied electric field, the non-diagonal

elements of the polarizability tensor are zero or very small

compared to the diagonal elements. The polarizability is

therefore represented in practice as ‘‘mean polarizability,’’

i.e., the average polarizability over the three axes of the

molecule, and equals one-third of the trace. It has been

shown that a is related to the molecular volume (Lewis

et al., 1994), hydrophobicity (Breneman and Rhem, 1997),

and the electrophilic superdelocalizability (Clare and

Supuran, 1998).

According to classical chemical theory, all chemical

interactions are by nature either electrostatic (polar) or

orbital (covalent) driven. In quantum chemistry, covalent

interactions arise from orbital overlap. The interaction of two

orbitals depends on their energy eigenvalues. Consequently,

energies associated with the highest occupied molecular

orbital (EHOMO) and the lowest unoccupied molecular orbital

(ELUMO) are often good candidates for 2-dimensional

descriptors. For example, EHOMO might model the covalent

basicity of a hydrogen bond acceptor or the ELUMO might

model the covalent acidity of the proton of an H bond donor.

Further interpretation is possible because the HOMO energy

is related to the ionization potential and is a measure of the

molecule’s tendency to be attacked by electrophiles. Cor-

respondingly, the LUMO energy is related to the electron

affinity and is a measure of a molecule’s tendency to be

attacked by nucleophiles (Tuppurainen et al., 1991). Fur-

thermore, according to frontier molecular orbital theory,

transition state formation involves the interaction between

the frontier orbitals of reacting species. The HOMO–LUMO

gap, i.e., the difference between the EHOMO and the ELUMO, is

an important stability index (Lewis et al., 1994).

A large HOMO–LUMO gap implies high stability for

the molecule in the sense of its lower reactivity in chemical

reactions. The concept of chemical hardness has been

derived from the basis of the HOMO–LUMO energy gap

(Klopman and Iroff, 2004).

PCR

18 quantum descriptors were calculated for each molecule

studied. All the descriptors representing the electrostatic

potential together with all frontier orbital descriptors used

in this work are listed in Table 2.

In order to get the linear relationship with independent

variables, logarithms of the inverse of biological activity

(Log 1/IC50) data of 35 molecules were used.

PCA is a multivariate technique that in QSAR analyzes a

data matrix in which molecules are described by several

inter-correlated quantitative-dependent descriptors. Its goal

is to extract the important information from the matrix, to

represent it as a set of new orthogonal variables called

principal components, and to display the pattern of

Med Chem Res (2014) 23:2046–2061 2055

123

similarity of the observations and of the variables as points

in maps.

PCA was performed on the calculated quantum

descriptors. All the calculated PCs with their eigenvalues

are shown in Table 3. In this Table, the eigenvalues, the

percentage of variances explained by each eigenvalue, and

the cumulative percentage of variances are represented.

Therefore, we restricted the next studies to PCs and

selection of best subset of these PCs to perform linear and

non-linear regression methods.

Figure 2 shows how the 35 molecules are distributed in

the space of the first two principal components of the

quantum descriptor matrix. These two components retain

70 % of the variance. As can be seen in this figure, four

molecules are almost different in their quantum properties

(molecules no. 3, 6, 20, and 35). In practice, real data often

contain some outliers and usually they are not easy to be

separated from the data set. The need to determine outliers in

QSAR original data sets is important to insure model quality.

These molecules (as potential outliers) were retained in the

dataset and were investigated more in the determination of

applicability domain of developed models of this study.

At the first model, PCR, a multivariate projection

method was used for constructing a relationship between

quantum descriptors and pIC50s of compounds of interest.

In a typical PCR procedure, a PCA is followed by a MLR

step between the Y (pIC50s) matrix and the principal

components of the X quantum descriptors matrix.

Using the above procedure and factor scores as the

predictor parameters, the following equation was obtained:

pIC50 ¼ 6:414 �0:094ð Þ � 1:596 �0:368ð Þ� PC9 þ 0:382 �0:106ð Þ � PC6

þ 0:949 �0:445ð Þ � PC10 N ¼ 28; R2 ¼ 0:604

For evaluation of the predictive power of the generated

PCR model, the developed model was applied for

Table 3 The result of principle component analysis applied on the

calculated quantum descriptors

No. of PCs Eigen values Variance

explained by

each PCs

Cumulative

variance

explained

1 6.890 38.279 38.279

2 5.755 31.973 70.253

3 1.574 8.747 79.000

4 1.249 6.938 85.937

5 1.107 6.151 92.088

6 0.689 3.825 95.914

7 0.452 2.512 98.426

8 0.150 0.834 99.260

9 0.058 0.322 99.582

10 0.039 0.214 99.796

11 0.031 0.170 99.966

12 0.005 0.028 99.994

13 0.001 0.006 100.000

14 0.000 0.000 100.000

15 0.000 0.000 100.000

16 0.000 0.000 100.000

17 0.000 0.000 100.000

18 0.000 0.000 100.000

Fig. 2 Plot of PC1 versus PC2

of calculated quantum

descriptors for training and test

sets

2056 Med Chem Res (2014) 23:2046–2061

123

prediction of pIC50 values of all compounds in the

calibration and prediction set. The calculated pIC50 for

each molecule by model is summarized in Table 4.

Experimental versus predicted values for pIC50 values

of training and test set, obtained by the PCR modeling, are

shown graphically in Fig. 3a.

This model was validated by some statistics parameters

such as PRESS and RMSE and also by Tropsha parameters

tests and results are reported in Table 3. It is clear that this

model on the basis of statistics parameters and Tropsha

parameter must be rejected. With respect to these results,

we decided to try non-linear regression methods: radial

basis function artificial neural network, to obtain robust and

predictive model able to describe a relationship between

the structure and human glucagon receptor inhibitory

activity of the studied compounds.

RBFNN

Another way to find a relationship between the biological

activity and PCs is a non-linear modeling using PCs as

input and ANN as a regression tool.

In model formation step, an RBFNN was a built model

to make a relationship between PCs and the pIC50. This

model is called PCA-RBFNN after here.

Table 4 The experimental activity of the compounds used in this

study and their predicted values by PCR and ANN

Molecule no. Experimental

activity

Predicted

activity by

PCR

Predicted activity

by PCA-RBFNN

1 7.431 7.330 7.431

2 6.931 7.164 6.931

3 8.275 7.620 8.275

4 6.173 6.363 6.173

5 7.356 6.625 7.356

6 5.910 6.385 5.910

7 6.795 6.682 6.795

8 6.494 6.412 6.494

9 6.744 6.111 6.744

10 6.096 6.651 6.096

11a 5.853 6.698 5.855

12 6.376 6.054 6.376

13a 6.376 6.482 6.376

14 6.468 6.723 6.468

15a 5.552 5.963 5.552

16 5.772 6.250 5.772

17 5.866 6.184 5.866

18 7.301 6.537 7.301

19 6.585 6.490 6.585

20 7.301 6.527 7.301

21 6.721 6.510 6.721

22a 7.045 6.230 7.045

23 6.677 6.818 6.677

24 6.795 6.588 6.795

25a 6.522 6.555 6.522

26a 7.096 6.808 7.096

27 6.000 7.030 6.000

28 5.939 6.415 5.939

29 6.376 6.232 6.376

30a 5.576 6.383 5.576

31 4.847 4.803 5.506

32 5.546 5.130 5.546

33 5.324 5.881 5.433

34 5.424 6.014 5.424

35 5.841 5.835 5.373

a Molecules in test set assigned by Kennard and Stone algorithm

Fig. 3 Plot of predicted activity against the corresponding experi-

mental activity for: a PCR and b PCA-RBFNN models

Med Chem Res (2014) 23:2046–2061 2057

123

The input of the network was the eigenvalue ranked

PCs, the number of them to enter neural network varied

from 1 to 18, 8 PCs of them were selected as input of

networks. Using this number of PCs gave the best results

on the basis of the lowest root mean square error for

training set (RMSEC) and root mean square error of cross

validation (RMSECV) in the output of network (Fig. 4).

For the PCA-RBFNN model, the ‘‘spread’’ and the

number of the radial basis functions (the hidden layer units)

are the two important parameters influencing the perfor-

mances of the network developed. A robust model is

attained by selecting parameters that give the lowest error.

A special way based on response surface was used to

optimize these parameters. The surface plot of RMSECV

as a function of spread and number of nodes in hidden layer

is shown in Fig. 5. The results indicate that a PCA-RBFNN

with spread of 0.9 and 15 nodes in hidden layer resulted in

the optimum PCA-RBFNN performance.

The non-linear regression method was tuned using

training objects and it was evaluated by test samples.

Fig. 4 Optimization of number of PCs used in radial basis function

neural network

Fig. 5 Optimization of number

of neurons in hidden layer and

spread for RBF model

Table 5 Statistical parameters obtained for the ANN model

Parameter PCR PCA-RBFNN

Data set Training set Test set Training set Test set

N 28 7 28 7

R2 0.604 0.161 0.956 0.999

RMSE 0.498 0.874 0.154 0.001

PRESS 5.956 2.294 0.665 5.39E-6

R2LOO 0.511 0.987

RMSELOO 0.526 0.082

PRESSLOO 7.483 0.354

R2LMO 0.487 0.966

RMSELOO 0.549 0.099

PRESSLOO 8.004 0.377

R2 - R02/R2 -0.655 -2.996 -0.045

R2 - R0

02/R2 -0.651 -4.987 -0.045

k 1.000 0.975 0.998

k’ 0.995 1.017 1.001

Rm2 0.2242 0.0492 0.756

N number of molecules in data set, R2 correlation coefficient of

experimental and predicted activities, RMSE root mean square error,

PRESS predicted error sum of square, R2LOO correlation coefficient

of leave-one-out cross validation, RMSELOO root mean square error

of cross validation for leave-one-out cross validation, PRESSLOO

predictive residual sum of square of cross validation for leave-one-out

cross validation, R2LMO correlation coefficient of leave many out

cross validation, RMSELMO root mean square error of cross valida-

tion for leave many out cross validation, PRESSLOO predictive

residual sum of square of cross validation for leave many out cross

validation

2058 Med Chem Res (2014) 23:2046–2061

123

The predicted activities of pIC50 of training and test data

are listed in Table 4. Figure 3b depicts the plots of

observed versus predicted values for training and test set.

To compare the developed chemometrics methods in

predicting the activity of studied molecules, some statistics

for these models are included in Table 5. The statistical

parameters, such as R2 between the calculated and exper-

imental values obtained using the PCA-RBFNN for train-

ing and test sets, are shown in Table 5. Inspection of

RMSE and PRESS values for the developed regression

methods reveals the superiority of the PCA-RBFNN

method over PCR in predicting the inhibitory activity of

human glucagon receptor antagonists studied in this work.

PCA-RBFNN also shows good fitting between predicted

values and experimental values in various sets. As it can be

seen in Table 5, the model also passed successfully Trop-

sha and Roy recommended criteria for predictability.

Results confirm developed PCA-RBFNN model has the

highest statistical value.

A very important step in QSAR model development is

the definition of the applicability domain of the models.

The reason is figuring the point out that developed model is

valid only within the same domain for which it was

developed. We used standard residual of activity calculated

by developed models and leverage for assigning AD.

Values of leverage could be calculated for both training

and test compounds. Calculating leverage of training set is

useful for determining the compounds which resulted in the

instability in model. On the other hand, calculating

leverage for objects that were not used in model building

(such as test set) is useful for assigning the applicability

domain of the model.

Applicability of domain of PCA-RBFNN model is

shown in Fig. 6. Response outliers are compounds that

have standard residual points greater than the two standard

deviation units. Influential compounds are points with

leverage value higher than the warning leverage limit. As

can be seen in this figure all molecules in training and test

set lie in application domain of developed model. Only two

molecules (molecules 20 and 6) have a leverage value

higher than the warning leverage limit (0.771) but these

molecules have standard residual values between ±2.0 SD

units. Therefore, the molecules 20 and 6 can be considered

as effective in fitting performance of model but there are no

credible reasons to consider them as outlier to delete from

studied molecules. On the other hand, as can be noted in

Fig. 6, two of the molecules, 31 and 35, have standard

residual values higher than cutoff level, but show leverage

within the limit. As a result, it can be mentioned that none

of the studied molecules are both a response outlier and a

high leverage compound.

Conclusion

In this study, the ability of DFT calculations for the

development of quantitative structure–activity relation-

ships was assessed. B3LYP/6-311G was employed to

Fig. 6 Williams plot of

developed PCA-RBFNN model

Med Chem Res (2014) 23:2046–2061 2059

123

calculate the molecular geometries and quantum descrip-

tors of 35 2-pyridyl-3,5-diaryl pyrrole derivatives as human

glucagon receptor antagonists.

PCR and PCA-RBFNN as two linear and non-linear

regression methods, respectively, were investigated for

building models. A comparison between the developed

statistical methods revealed that PCA-RBFNN represented

superior results and it could predict about 95 % of vari-

ances in the inhibitory activity data and root mean square

error of prediction was 0.154. On the basis of the results

shown in Table 5, the non-linear methods (PCA-RBFNN)

were better than the PCR method considerably in the

goodness of fit and predictivity parameters and other cri-

teria for evaluation of the proposed model. These great

results for non-linear models reflect a non-linear relation-

ship between the principle components obtained from

quantum descriptors and the glucagon inhibitory activity

for studied 2-pyridyl-3,5-diaryl pyrrole derivatives. Exter-

nal validation showed the predictive ability of the gener-

ated QSAR model. The predictive ability of the obtained

QSAR models was also estimated according to Tropsha

et al., and Roy and all the criteria were passed. The

applicability domain of the model was defined by leverage

value. None of the studied compounds were outside the

domain of the model.

Acknowledgments The authors gratefully acknowledge Vice

Chancellor for Research and Technology, Kermanshah University of

Medical Sciences for financial support. This article resulted from the

Pharm. D thesis of Zohreh Nazari, major of Pharmacy, Kermanshah

University of Medical Sciences, Kermanshah, Iran.

References

Agatonovic-Kustrin S, Beresford R (2000) Basic concepts of artificial

neural network (ANN) modeling and its application in pharma-

ceutical research. J Pharm Biomed Anal 22(5):717–727

Arkan E, Shahlaei M, Pourhossein A, Fakhri K, Fassihi A (2010)

Validated QSAR analysis of some diaryl substituted pyrazoles as

CCR2 inhibitors by various linear and nonlinear multivariate

chemometrics methods. Eur J Med Chem 45(8):3394–3406

Arulmozhiraja S, Morita M (2004) Structure-activity relationships for

the toxicity of polychlorinated dibenzofurans: approach through

density functional theory-based descriptors. Chem Res Toxicol

17(3):348–356

Becke AD (1993) Density-functional thermochemistry. III. The role

of exact exchange. J Chem Phys 98:5648

Breneman CM, Rhem M (1997) QSPR analysis of HPLC column

capacity factors for a set of high-energy materials using electronic

van der waals surface property descriptors computed by transfer-

able atom equivalent method. J Comput Chem 18(2):182–197

Cartier A, Rivail J-L (1987) Electronic descriptors in quantitative

structure–activity relationships. Chemometrics and Intelligent

Laboratory Systems 1(4):335–347

Clare BW (1995) Structure-activity correlations for psychotomimet-

ics. III: Tryptamines. Aus J Chem 48(8):1385–1400

Clare BW, Supuran CT (1994) Carbonic anhydrase activators. 3:

structure-activity correlations for a series of isozyme II activa-

tors. J Pharm Sci 83(6):768–773

Clare BW, Supuran CT (1998) Semi-empirical atomic charges and

dipole moments in hypervalent sulfonamide molecules: descrip-

tors in QSAR studies. J Mol Struct (Thoechem) 428(1):109–121

Cronce DT, Famini G, De Soto J, Wilson L (1998) Using theoretical

descriptors in quantitative structure–property relationships: some

distribution equilibria. J Chem Soc Perkin Trans 2(6):1293–1302

Fassihi A, Shahlaei M, Moeinifard B, Sabet R (2012) QSAR study of

anthranilic acid sulfonamides as methionine aminopeptidase-2

inhibitors. Monatsh Chem 143(2):189–198

Frisch M, Trucks G, Schlegel Hea, Scuseria G, Robb M, Cheeseman

J, Montgomery J, Vreven T, Kudin K, Burant J (2008) Gaussian

03, revision C. 02

Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chemical

descriptors in QSAR/QSPR studies. Chem Rev 96(3):1027–1044

Kennard R, Stone L (1969) Computer aided design of experiments.

Technometrics 11(1):137–148

Klopman G, Iroff LD (2004) Calculation of partition coefficients by

the charge density method. J Comput Chem 2(2):157–160

Kohn W, Sham LJ (1965) Self-consistent equations including

exchange and correlation effects. APS 140:A1133–A1138

Lewis D, Lake B, Ioannides C, Parke D (1994) Inhibition of rat hepatic

aryl hydrocarbon hydroxylase activity by a series of 7-hydroxy

coumarins: QSAR studies. Xenobiotica 24(9):829–838

Mulliken R (1955a) Electronic population analysis on LCAO-MO

molecular wave functions. III. Effects of hybridization on overlap

and gross AO populations. J Chem Phys 23(12):2338–2342

Mulliken R (1955b) Electronic population analysis on LCAO-MO

molecular wave functions. IV. bonding and antibonding in

LCAO and valence-bond theories. J Chem Phys 23:2343

Mulliken R (1955c) Electronic population analysis on LCAO [single

bond] MO molecular wave functions. II. Overlap populations,

bond orders, and covalent bond energies. J Chem Phys 23:1841

Mulliken RS (1955d) Electronic Population Analysis on LCAO–MO

Molecular Wave Functions I. J Chem Phys 23(10):1833–1840

Ordorica M, Velazquez M, Ordorica J, Escobar J, Lehmann P (1993)

A principal component and cluster significance analysis of the

antiparasitic potency of praziquantel and some analogues. Quant

Struct Act Relat 12(3):246–250

Parr RG, Yang W (1989) Density-functional theory of atoms and

molecules, vol 16. Oxford University Press, Oxford

Pasha F, Srivastava H, Singh P (2005) Comparative QSAR study of

phenol derivatives with the help of density functional theory.

Bioorg Med Chem 13(24):6823–6829

Saghaie L, Shahlaei M, Madadkar-Sobhani A, Fassihi A (2010a)

Application of partial least squares and radial basis function

neural networks in multivariate imaging analysis-quantitative

structure activity relationship: study of cyclin dependent kinase 4

inhibitors. J Mol Graph Model 29(4):518–528

Saghaie L, Shahlaei M, Madadkar-Sobhani A, Fassihi A (2010b)

Application of partial least squares and radial basis function

neural networks in multivariate imaging analysis-quantitative

structure activity relationship: study of cyclin dependent kinase 4

inhibitors. J Mol Graph Model 29(4):518–528

Saghaie L, Shahlaei M, Fassihi A, Madadkar-Sobhani A, Gholivand

M, Pourhossein A (2011) QSAR analysis for some diaryl-

substituted pyrazoles as CCR2 inhibitors by GA-stepwise MLR.

Chem Biol Drug Des 77(1):75–85

Saghaie L, Sakhi H, Sabzyan H, Shahlaei M, Shamshirian D (2013)

Stepwise MLR and PCR QSAR study of the pharmaceutical

activities of antimalarial 3-hydroxypyridinone agents using

B3LYP/6-311??G** descriptors. Med Chem Res

22(4):1679–1688

2060 Med Chem Res (2014) 23:2046–2061

123

Shahlaei M, Fassihi A (2013) QSAR analysis of some 1-(3,3-

diphenylpropyl)-piperidinyl amides and ureas as CCR5 inhibi-

tors using genetic algorithm-least square support vector machine.

Med Chem Res 22:4384–4400

Shahlaei M, Pourhossein A (2012) A 2D image-based method for

modeling some c-Src tyrosine kinase inhibitors. Med Chem Res

22:3012–3025

Shahlaei M, Pourhossein A (2013) Modeling of CCR5 antagonists as

anti HIV agents using combined genetic algorithm and adaptive

neuro-fuzzy inference system (GA–ANFIS). Med Chem Res

1–14

Shahlaei M, Fassihi A, Saghaie L (2010a) Application of PC-ANN

and PC-LS-SVM in QSAR of CCR1 antagonist compounds: a

comparative study. Eur J Med Chem 45(4):1572–1582

Shahlaei M, Sabet R, Ziari MB, Moeinifard B, Fassihi A, Karbakhsh

R (2010b) QSAR study of anthranilic acid sulfonamides as

inhibitors of methionine aminopeptidase-2 using LS-SVM and

GRNN based on principal components. Eur J Med Chem

45(10):4499–4508

Shahlaei M, Fassihi A, Saghaie L, Arkan E, Madadkar-Sobhani A,

Pourhossein A (2011a) Computational evaluation of some

indenopyrazole derivatives as anticancer compounds; application

of QSAR and docking methodologies. J Enzym Inhib Med Chem

28:16–32

Shahlaei M, Madadkar-Sobhani A, Fassihi A, Saghaie L, Shamshirian

D, Sakhi H (2011b) Comparative quantitative structure–activity

relationship study of some 1-aminocyclopentyl-3-carboxyamides

as CCR2 inhibitors using stepwise MLR, FA-MLR, and GA-

PLS. Med Chem Res 21:100–115

Shahlaei M, Madadkar-Sobhani A, Saghaie L, Fassihi A (2011c)

Application of an expert system based on Genetic Algorithm–

Adaptive Neuro-Fuzzy Inference System (GA–ANFIS) in QSAR

of cathepsin K inhibitors. Expert Sys Appl 39:6182–6191

Shahlaie M, Fassihi A, Pourhossein A, Arkan E (2013) Statistically

validated QSAR study of some antagonists of the human CCR5

receptor using least square support vector machine based on the

genetic algorithm and factor analysis. Med Chem Res 1–16

Sotomatsu T, Murata Y, Fujita T (1989) Correlation analysis of

substituent effects on the acidity of benzoic acids by the AM1

method. J Comput Chem 10(1):94–98

Tetko I, Luik A, Poda G (1993) Applications of neural networks in

structure-activity relationships of a small number of molecules.

J Med Chem 36(7):811–814

Trohalaki S, Gifford E, Pachter R (2000) Improved QSARs for

predictive toxicology of halogenated hydrocarbons. Comput

Chem 24(3):421–427

Tuppurainen K, Lotjonen S, Laatikainen R, Vartiainen T, Maran U,

Strandberg M, Tamm T (1991) About the mutagenicity of

chlorine-substituted furanones and halopropenals. A QSAR

study using molecular orbital indices. Mutat Res 247(1):97–102

Xiang Y, Liu M, Zhang X, Zhang R, Hu Z, Fan B, Doucet J, Panaye A

(2002) Quantitative prediction of liquid chromatography reten-

tion of N-benzylideneanilines based on quantum chemical

parameters and radial basis function neural network. J Chem

Inf Comput Sci 42(3):592–597

Yan X-F, Xiao H-M, Gong X-D, Ju X-H (2005) Quantitative

structure–activity relationships of nitroaromatics toxicity to the

algae (Scenedesmus obliguus). Chemosphere 59(4):467–471

Med Chem Res (2014) 23:2046–2061 2061

123

computational neural network analysis of the affinity of 2-pyridyl-3,5-diaryl pyrroles analogs for...

Documents