computational neural network analysis of the affinity of 2-pyridyl-3,5-diaryl pyrroles analogs for...
TRANSCRIPT
ORIGINAL RESEARCH
Computational neural network analysis of the affinityof 2-pyridyl-3,5-diaryl pyrroles analogs for the humanglucagon receptor using density functional theory
Mohsen Shahlaei • Zohreh Nazari
Received: 4 June 2013 / Accepted: 19 September 2013 / Published online: 2 October 2013
� Springer Science+Business Media New York 2013
Abstract In our continuing efforts to provide a predictive
quantitative structure activity relationship using different
algorithms, radial basis function neural networks (RBFNN)
have been successfully combined with principal compo-
nent analysis (PCA) and trained to predict the biological
activity (pIC50) of 2-pyridyl-3,5-diaryl pyrrole derivatives
as human glucagon receptor antagonists. A set of quantum
descriptors, including energy of HOMO, energy of LUMO,
softness, hardness, etc. descriptors, were calculated using
DFT-B3LYP method, with the basis set of 6-311G. An
ANN with 1-15-1 architecture was generated using eight
principal components. A principal component regression
(PCR) model was also developed for comparison. It was
found that a properly selected and trained RBFNN with a
suitable training set could represent the dependence of the
biological activity on the principal components that were
calculated using quantum descriptors fairly well. For
evaluation of the predictive ability of the developed PCA-
based RBFNN model, an optimized network was applied to
predict the pIC50s of compounds in the test set, which were
not used in the modeling phase of the procedure. A squared
correlation coefficient (R2) and root mean square error of
0.161 and 0.874 for the test set by the PCR model should
be compared with the values of 0.999 and 0.0154 by the
principal component based RBFNN model. These
improvements are due to the fact that the pIC50s of
2-pyridyl-3,5-diaryl pyrrole derivatives show non-linear
correlations with the principal component extracted from
the quantum descriptors.
Keywords Glucagon receptor inhibition activity �2-Pyridyl-3,5-diaryl pyrrole derivatives �Radial basis function neural network �Density functional theory
Introduction
The prediction of pharmaceutical and biological activities,
physicochemical and pharmacokinetic properties/activities
of small molecules is the main goal of quantitative struc-
ture–property/activity relationships (QSPRs/QSARs)
(Cronce et al., 1998; Arkan et al., 2010; Saghaie et al.,
2010a, 2011; Shahlaei and Fassihi, 2013; Shahlaei et al.,
2010a, 2011a, c; Shahlaei and Pourhossein, 2012, 2013;
Shahlaei et al., 2010b; Shahlaie et al., 2013). A typical
QSAR model is developed on the basis of the correlation
between the experimental biological activity and structural
descriptors reflecting the molecular structure of the com-
pounds of interest. Since these structural descriptors are
determined solely from computational techniques, a priori
prediction of the activities of compounds is feasible, no
laboratory measurements are required, so this saves time,
chemicals, space, instrument and alleviating safety (toxic-
ity), and disposal concerns. For many years, QSAR models
have been efficiently employed for the study of biological
mechanisms of various reactive compounds (Arkan et al.,
2010; Saghaie et al., 2010b, 2011, 2013; Shahlaei and
Fassihi, 2013; Shahlaei et al., 2010a, 2011a, b, c; Shahlaei
and Pourhossein, 2012, 2013; Shahlaei et al., 2010b;
Shahlaie et al., 2013). To obtain a significant correlation
M. Shahlaei (&)
Novel Drug Delivery Research Center, School of Pharmacy,
Kermanshah University of Medical Sciences,
67346-67149 Kermanshah, Iran
e-mail: [email protected]; [email protected]
Z. Nazari
Student Research Committee, Kermanshah University of
Medical Sciences, Kermanshah, Iran
123
Med Chem Res (2014) 23:2046–2061
DOI 10.1007/s00044-013-0801-3
MEDICINALCHEMISTRYRESEARCH
between dependant variables (descriptors) and a dependant
variable (biological activity), it is crucial that suitable
descriptors are used (Karelson et al., 1996), because the
success of a QSAR model is highly dependent on the
selection of significant descriptors. In recent years, a
number of quantum-chemical descriptors such as charges,
orbital energies, frontier orbital densities, and dipole
moment, etc. estimated from density functional theory
(DFT) calculations have been employed in a successful
manner in developing different QSAR models for pre-
dicting biological activity in terms of the structures and
physicochemical properties of compounds (Pasha et al.,
2005; Saghaie et al., 2013).
Quantum-chemical descriptors, which can be obtained by
calculation, can describe defined molecular activities, and
are not restricted to closely related structural molecules.
Therefore, in recent years, the development of QSAR
models based on quantum-chemical descriptors has gained
significant interest (Karelson et al., 1996). Recently, it has
been reported that some comparative QSAR models, using
the descriptors calculated using the DFT approach instead of
the semi-empirical techniques AM1 or PM3, can improve
the accuracy of the results and lead to more reliable QSAR
models. A QSAR work by Arulmozhiraja and Morita (2004)
which studied relationships between the various DFT-based
descriptors (chemical softness, electronegativity, and elec-
trophilicity index) and the toxicity of 33 polychlorinated
dibenzofurans (PCDFs) showed a moderate to satisfactory
success for the DFT-based reactivity descriptors in the
toxicological QSARs. Pasha et al. (2005) studied quantum-
chemical descriptors based QSAR models on toxicity of
phenol derivatives with AM1, PM3, PM5, and DFT meth-
ods, indicating that the DFT method is more reliable than
others and has an improved predictive ability.
Various statistical and mathematical approaches for
building QSAR models have been applied including mul-
tiple linear regression (MLR), principal component ana-
lysis (PCA), and partial least-squares regression (PLS)
(Fassihi et al., 2012; Shahlaei et al., 2010a, 2011b). In
addition, artificial neural networks (ANNs) have become
well-known and popular due to their ability where complex
non-linear relationships exist between dependent and
independent variables (Arkan et al., 2010; Saghaie et al.,
2010a; Shahlaei et al., 2010a; Shahlaei and Pourhossein,
2012; Shahlaei et al., 2010b). ANNs are biologically
inspired computational algorithms designed to simulate the
way in which the human brain processes input data and
extracts valuable information. ANNs do not necessitate
explicit formulation of the mathematical or physical rela-
tionships between dependent and dependent variables of
the handled problem. These give ANN an advantage over
other mathematical and statistical regression approaches
for some chemical applications. For the reasons mentioned
above, in recent years, various ANNs algorithms have been
used to an extensive variety of chemical problems. There
are several types of ANNs algorithms that have been
developed by now and new ones are invented every week
(Agatonovic-Kustrin and Beresford, 2000). The behavior
and response generated by a typical ANNs algorithm is
determined by transfer functions of its neurons, by learning
rules, and by the architecture itself. A typical ANNs
include artificial neurons or processing elements, con-
nected with weights, which constitute the network structure
and are organized in layers.
The extensive applications of ANNs in science stem
from their flexibility and power to model non-linear sys-
tems without prior knowledge of an empirical model.
The main goal of the current study is to develop a QSAR
model based on quantum descriptors using radial basis
function-ANNs, for modeling and predicting human glu-
cagon receptor antagonist activities values of 2-pyridyl-
3,5-diaryl pyrrole derivatives. In the first step, a PCR
model was constructed. Then, for inspection of non-linear
relation between principal components calculated using
quantum descriptors, an ANN model was developed for
predicting the pIC50 values and the results were compared
with the experimental and calculated values using the PCR
model.
Methods
Calculation of quantum descriptors
The basic skeletons of studied compounds and details were
summarized in Table 1. The initial structures of all
2-pyridyl-3,5-diaryl pyrrole derivatives were constructed
by the CS Chem3D software (Ultra 10.0, ChemOffice
2006, CambridgeSoft Corporation).
To save computational time, initial geometry optimiza-
tions were carried out with the molecular mechanics (MM)
method using the MM? force fields.
Various computational studies (Yan et al., 2005; Tro-
halaki et al., 2000; Saghaie et al., 2013) indicate that DFT
(Kohn and Sham, 1965; Parr and Yang, 1989) with B3LYP
level of theory (Becke, 1993) method is suitable to the
QSAR studies. In this study, 3,5-pyridyl-3,5-diaryl pyrroles
compounds have been fully optimized using the B3LYP
method and 6-311G basis set. Frequency calculations show
that all molecular structures are stable and correspond to the
minimum point on the potential energy surface. Quantum-
chemical calculations were carried out by the Gaussian 03
(Frisch et al., 2008) program on linux platform.
As listed in Table 2, 18 quantum-chemical descriptors
were used to analyze their variations and efficiency of the
inhibition activity prediction of the compounds of interest.
Med Chem Res (2014) 23:2046–2061 2047
123
Table 1 The general molecular structure and substituent details of compounds used in this study
Y
X
F
N
Ra
RaYXCompd
S(O)Me
HN
N
NHN1
SMeHN
N
NHN2
S(O)MeHN
CH
NHCH3
S(O)Me
HC
NH
CHNH4
SMe
HN
CH
NHCH5
SMe
HC
NH
CHNH6
2048 Med Chem Res (2014) 23:2046–2061
123
Table 1 continued
NH
R1R3
R2
R3R2R1Compd
NFH7
NFCH38
NF9
NF10
NFCl11
NF
Cl
12
NFCl
Cl
13
NFF14
NFOCH315
NFCH316
Med Chem Res (2014) 23:2046–2061 2049
123
Table 1 continued
NFCH2CH317
NFNO218
NFCO2CH2CH319
NFCN20
NFNH221
NFCOOH22
NCl23
NClCl24
NBrCl25
N
Br
Cl26
N
Br
Cl27
NOCH3Cl28
2050 Med Chem Res (2014) 23:2046–2061
123
The quantum descriptors employed in this work, such as
polarizability (a), dipole moment (l), energy of the highest
occupied molecular orbital (EHOMO), energy of the lowest
unoccupied molecular orbital (ELUMO), most negative
atomic charge, most positive charge, etc., have been
obtained directly or indirectly (Table 2) from the Gaussian
output files.
The selection of input variables to ANN is necessary to
avoid ‘‘over fitting’’ (Tetko et al., 1993) in the case of
many input descriptors offered. As a linear technique for
dimensionality reduction, PCA can transform the input
data set from its original form (points in m-dimensional
space) to its new form (points in p-dimensional space),
where p is less than m. During the process, most of the
amount of the variability of the original input data set is
retained. Using the calibrated input data set in a lower
dimension, smaller ANN is applied in the performance of
prediction.
Principal component regression (PCR)
Next, a PCA was performed for variable reduction and data
interpretation. In PCA, descriptors describe the same
property cluster together, hence it is easy to describe the
predicted activity with a less number of independent
variables.
Table 1 continued
N
OCH3
Cl29
N
H3CO
Cl30
N
H3CH2CO
Cl31
N
H3CH2CH2CO
Cl32
N
O
Cl33
N
OH2CCHH3C
H3C
Cl34
N
OH3CH2CH2C
Br
Cl35
Med Chem Res (2014) 23:2046–2061 2051
123
PCR is a standard technique among the multivariate
regression methods available for QSAR studies. In a typi-
cal PCR, a model can be explained as follows; consider the
following equation:
y ¼ Xbþ e ð1Þ
This equation describes the relationship between a set of
descriptors X (regressors) and the pIC50s y (regressands) by
means of a vector b. Note that the vectors y and b are
considered to be column vectors. If k denotes the number
of molecules used in the regression, p the number of
descriptors which are calculated for each molecule, and y
biological activity that has to be regressed, then y is a
vector of length k and X a (k 9 p) matrix containing the
calculated descriptors as row vectors. The regression vector
is represented by b of length p which has to be determined
in the regression step. In this step, the Euclidean length of
the error vector denoted as e has to be minimized by
solving a least-squares problem. The idea of PCR is to
decompose X into a matrix R of column vectors
of k eigens vectors (factors) of length p and a orthogonal
(k 9 k) matrix C containing the scores as rows.
X ¼ CR ð2Þ
The matrix C and the eigenvalues [k] are given by
solving the eigenvalue problem
CT ZC ¼ ½k� ð3Þ
and the eigenvector matrix R (k 9 p) is calculated by
R ¼ CTX: ð4Þ
In Eq. (3) Z ¼ XXT denotes the (k 9 k) covariance
matrix and [k] is the diagonal matrix of eigenvalues. The
rows in CT are the eigenvectors of Z and its columns are the
‘‘scores.’’ The column vectors of the square matrix C are
orthonormal and often called principal components (PCs).
The scores with respect to the factors are the new
orthogonal variables on which the properties are now
regressed instead of the original variables (absorbances at
certain wavelengths). Due to their orthogonality those PCs
which are not assumed to be significant can be omitted, i.e.,
the calculation usually is performed with a reduced matrix
(Cr resp. Rr) for the calibration. Here r denotes the number
of factors included in the model. The regression vector
with respect to the original variables (theoretical molecular
descriptors) can be obtained as follows:
b_
¼ Rþr CTr y;
where Rr? denotes the pseudo inverse of Rr.
The greatest amount of variability of the original
quantum-electronic descriptor data set is represented by the
first principal component (PC), and the second principal
component explains the maximum variances of the residual
data set. Then, the third one will describe the most
important variability of the next residual data set, and so
on. According to the theory of least squares, the eigen-
vectors of all principal components are orthogonal to each
other in multidimension data space defined by quantum-
electronic descriptors. Generally speaking, only p principal
components are enough to account for the most variance in
an m-dimensional data set, where p is the number of
important principal components of the data set, and m
means the number of all the principal components in the
data set. It is obvious that p is less than m. So PCA is
generally regarded as a data reduction technique. That is to
say, a multi-dimensional data set can be projected to a
lower dimension data space without losing most of the
information of the original data set by PCA.
Splitting PCs matrix into training and test sets
At the next step of developing QSAR models, and in order
to develop a reliable (validated) QSAR model, consecutive
molecules are selected and put alternatively in the training
and test sets. The division of an original dataset into the
training and test sets can be carried out using various
algorithms. One of the most popular and successful algo-
rithms is Kennard and Stone algorithm.
Table 2 Quantum-electronic descriptors used in this study
Descriptor abbreviation Descriptor definition
EHOMO The energy of the highest occupied
molecular orbital
ELUMO The energy of the lowest occupied
molecular orbital
H–L The HOMO–LUMO energy gap
Electronegativity v ¼ EHOMO�ELUMO
2
Hardness g ¼ � EHOMO�ELUMO
2
Electrophilicity x ¼ v2
2g
MPC The most positive charge
LNC The least negative charge
SSC The sum of square of charges
SSPC The sum of square of positive charges
SSNC The sum of square of negative charges
SPC The sum of positive charges
SNC The sum of negative charges
SAC The sum of absolute charge
DMx Dipole moment in x direction
DMy Dipole moment in y direction
DMz Dipole moment in z direction
TDM Total Dipole moment
2052 Med Chem Res (2014) 23:2046–2061
123
The Kennard–Stone method selects the molecules which
are furthest from each other in the dataset one by one (Kennard
and Stone, 1969). The quantity employed to measure the
distance is the Euclidean distance. For a response matrix
with N molecules (rows) and K PCs (columns), the multi-
variate Euclidean distance between samples i and j is
Dij ¼ jjxi � xjjj ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
k
v¼1
ðxiv � xjvÞ2v
u
u
t
The first step is to select the two furthest molecules
(maximum Dij). The third molecule is picked out by
performing the following steps: the distance between each
molecule and the two furthest molecules are calculated; the
shortest of each of these pairs of distances is selected and
the molecule with the maximum value in this set of
minimal distances is chosen.
Generalizing, if M molecules from the original dataset
with N molecules have been selected, the next molecule
M?1 is chosen by calculating:
diðMÞ ¼ minfDi1;Di2; . . .DiMg
for the N - M molecules that have not been picked out
previously. Of these, the one which complies with the
following equation is selected:
diðM þ 1Þ ¼ maxfdiðMÞg
Radial basis function neural networks (RBFNNs)
In the present study, one type of neural networks, namely
RBFNN, was employed to establish an alternative non-
linear model. The theory of RBFNN has been adequately
described in detail elsewhere (Xiang et al., 2002). so we
will limit ourselves to a brief outline highlighting only the
most important aspects.
Usually, RBFNN comprises three layers, i.e., the input
layer, the hidden layer, and the output layer (Fig. 1). The
input layer does not process the information, since it only
distributes the input vectors to the hidden layer, whereas
the latter consists of a number of RBF units (nh) and biases
(bk). Each neuron on the hidden layer employs a radial
basis function as a non-linear transformation function to
operate on the input data. The frequently used RBF is a
Gaussian function that is characterized by a center and a
width. This function measures the Euclidean distance
between the input vector (X) and the radial basis function
center (cj) and performs the non-linear transformation
within the hidden layer as follows:
hj ¼ expð�jjX � cjjj2=r2j Þ;
where hj denotes the output of the jth RBF unit, while cj
and rj are the center and width of such unit, respectively.
The operation of the output layer is linear and is given by:
ykðXÞ ¼X
nh
i¼1
wkjhjðXÞ þ bk;
where yk is the kth output unit for the input vector X, wkj is
the weight connection between the kth output unit and the
jth hidden layer unit, and bk is the respective bias.
From Eqs. (2) and (3), one can see that the design of
RBFNN involves selecting centers, number of hidden layer
units, widths, and weights. There are various ways for
selecting the centers, such as random subset selection,
k-means clustering, and RBF–PLS. In this study, a forward
subset selection routine was used to select the centers from
the training set samples with regard to the widths of the
radial basis functions, those can either be chosen equal for all
the units or different for each unit. Here, we limited our-
selves to Gaussian functions with a constant width for all the
units. Furthermore, the adjustment of the connection weight
between the hidden layer and output layer was performed
using a least-squares solution after the selection of the RBF
centers and width. All RBFNN calculations were performed
using home-developed scripts using the MATLAB package
(www.mathworks.com/products/matlab/).
Moreover, the overall performance of the final RBFNN
model was evaluated in terms of its root mean squared
error (RMSE), and its goodness and robustness estimated
by the same statistical parameters as those used for the
linear model.
Validation and evaluation
Testing the stability, predictive power, and generalization
ability of the models is a very important step in QSAR
Fig. 1 The typical architecture of the RBF-ANN
Med Chem Res (2014) 23:2046–2061 2053
123
study. As for the validation of predictive power of a QSAR
model, two basic principles (internal validation and exter-
nal validation) are available.
In both validation methods, R2, which presents the
explained variance for given set, was used to determine the
goodness of model’s fit performance. In addition, the pre-
diction performance of the built models must be estimated
in order to build a successful QSAR model. In this study,
the prediction performance of the developed models was
evaluated using two parameters, the RMSE and percent
relative standard error [RSEP (%)].
The cross validation is one of the most popular methods
for internal validation. In this study, the internal predictive
capability of the model was evaluated by leave-one-out
cross validation (Q2LOO). A good Q2LOO often indicates
a good robustness and high internal predictive power of a
QSAR model. However, recent studies of Tropsha and co-
workers [16] indicate that there is no evident correlation
between the value of Q2LOO and actual predictive power
of a QSAR model, revealing that the Q2LOO is still
inadequate for a reliable estimate of model’s predictive
ability for all new compounds. In order to determine both
the generalizability of QSAR models for new compounds
and the true predictive ability of the models, the statistical
external validation can be used at the model development
step by properly employing a prediction set for validation.
The results of data splitting using Kennard-Stone algorithm
are shown in Table 1, as the test set is indicated with an
asterisk.
Also, some criteria by Tropsha were suggested, if these
criteria were met, then it can be said that the model is
predictive [43]. These criteria include:
R2LOO [ 0:5
R2 [ 0:6
R2 � R20
R2\0:1
R2 � R020
R2\0:1
0:85\k\1:15 or 0:85\k0\1:15
R2 is the correlation coefficient of regression between the
predicted and observed activities of compounds in training
and test set. R20 is the correlation coefficients for regres-
sions between predicted versus observed activities through
the origin, R020 is the correlation coefficients for regressions
between observed versus predicted activities through the
origin, and the slope of the regression lines through the
origin is assigned by k and k’, respectively. Details of
definitions of parameters such as R20;R
020 , k and k’ are
presented obviously in literature and are not written again
here for shortness [43].
Also, In addition, according to Roy and Roy [44] the
difference between values of R20 and R
020 must be studied
and given importance. They suggested following modified
R2 form
R2m ¼ R2 1�
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R2 � R20
q
�
�
�
�
�
�
�
�
� �
If R2m value for given model is [0.5, indicates good
external predictability of the developed model.
The actual predictability of each model developed on
the training set is confirmed on an external test set [43] and
is calculated from: R2p ¼ 1� PRESS=SD; where PRESS is
the sum of squared differences between the measured
activity and the predicted value for each compound in the
test set, and SD is the sum of squared deviations between
the measured activity for each molecule in the test set and
the mean measured value of the training set.
Developed models are also tested for reliability and
robustness by Y-randomization testing: new models are
recalculated for randomly reordered response. We provided
evidence that the proposed models are well founded, and
not just the result of chance correlation, by obtaining new
models on randomized response with significantly lower R2
than the original models. If the results show high R2, it
implies that an acceptable QSAR model cannot be
obtained.
Applicability domain of the model
The presence of outliers was confirmed by the Williams
plot. This plot includes response outliers, i.e., molecules
with standardized residuals greater than two standard
deviation units, and structurally influential compounds in
determining figures of merit and statistical parameters of
the developed model, i.e., molecules with high leverage
value (h) greater than warning leverage limit, h* = 3k’/n,
k’ is the number of model variables plus one, and n is the
number of the molecules applied in model development
[22].
In another way, applicability domain of a QSAR model
must be defined and predictions of activity for only those
compounds that fall into this domain may be considered
reliable [21]. Such QSAR models could be used for
screening new compounds. Williams plot could be used for
an immediate and simple graphical detection of both the
response outliers and structurally influential compounds in
a model, i.e., h [ h*. Compounds with h [ h* hardly
affect the goodness of fit of the developed model but these
compounds may do not be an outlier because of low
residual. It must be noted that compounds with high value
of leverage and good fitting in the developed model can
stabilize the model. On the other hand, compounds with
bad fitting in the developed model may be outliers. Thus,
2054 Med Chem Res (2014) 23:2046–2061
123
combination of leverage and the standardized residuals
could be used for assigning the applicability of the domain.
Results and discussion
Interpretation of quantum descriptors
Many molecular properties depend on intermolecular
interactions. The main component of these interactions is
electrostatic in its nature. Electrical charges in the molecule
are simply the driving force of electrostatic interactions.
Charge-based descriptors have therefore been widely used
as chemical reactivity indices or as measures of weak
intermolecular interactions. The charge distributions in a
given molecule and partial charges on the atoms can be
estimated using quantum-chemical calculations. One of the
most important parts of standard output of almost any
quantum calculation is the Mulliken atomic charges
(Mulliken, 1955a, b, c, d). Usually, the minimum (most
negative) and maximum (most positive) atomic partial
charges in the molecule or the minimum or maximum
partial charges for particular types of atoms are employed
as quantum descriptors (Clare and Supuran, 1994; Cartier
and Rivail, 1987). Different sums of absolute or square
values of partial charges (for example sum of positive
charges and sum of square of positive charges) have also
been employed to explain intermolecular interactions.
Other usual charge-based quantum-chemical indices used
as electrostatic descriptors in QSAR models are the aver-
age absolute atomic charge (Clare and Supuran, 1994;
Ordorica et al., 1993) and a polarity parameter defined as
the difference between the values of the most positive and
negative charge (Clare and Supuran, 1994; Cartier and
Rivail, 1987; Clare, 1995).
Electrostatic interactions can also be denoted by the
respective electrical moments and their components. The
polarity is denoted by the dipole moment (l). The polari-
zation of a molecule by an external electric field can be
defined in terms of nth order susceptibility tensors of the
molecule (Sotomatsu et al., 1989). The first order term that
is referred to as the polarizability of the molecule repre-
sents the relative susceptibility of the electron cloud of an
atom or a molecule to be distorted from its normal shape by
the presence of an external field.
Because of this distortion, an induced electric dipole
moment appears. Polarizability (a) is a tensor relating the
induced dipole moment (l ind) to the applied electric field
strength. The non-diagonal elements of the tensor represent
the polarizability of the electrons along one of the axes of
the coordinate system due to a component of the applied
electric field along another of the coordinate axes. As this
effect is insignificant compared to the polarizability in the
direction of the applied electric field, the non-diagonal
elements of the polarizability tensor are zero or very small
compared to the diagonal elements. The polarizability is
therefore represented in practice as ‘‘mean polarizability,’’
i.e., the average polarizability over the three axes of the
molecule, and equals one-third of the trace. It has been
shown that a is related to the molecular volume (Lewis
et al., 1994), hydrophobicity (Breneman and Rhem, 1997),
and the electrophilic superdelocalizability (Clare and
Supuran, 1998).
According to classical chemical theory, all chemical
interactions are by nature either electrostatic (polar) or
orbital (covalent) driven. In quantum chemistry, covalent
interactions arise from orbital overlap. The interaction of two
orbitals depends on their energy eigenvalues. Consequently,
energies associated with the highest occupied molecular
orbital (EHOMO) and the lowest unoccupied molecular orbital
(ELUMO) are often good candidates for 2-dimensional
descriptors. For example, EHOMO might model the covalent
basicity of a hydrogen bond acceptor or the ELUMO might
model the covalent acidity of the proton of an H bond donor.
Further interpretation is possible because the HOMO energy
is related to the ionization potential and is a measure of the
molecule’s tendency to be attacked by electrophiles. Cor-
respondingly, the LUMO energy is related to the electron
affinity and is a measure of a molecule’s tendency to be
attacked by nucleophiles (Tuppurainen et al., 1991). Fur-
thermore, according to frontier molecular orbital theory,
transition state formation involves the interaction between
the frontier orbitals of reacting species. The HOMO–LUMO
gap, i.e., the difference between the EHOMO and the ELUMO, is
an important stability index (Lewis et al., 1994).
A large HOMO–LUMO gap implies high stability for
the molecule in the sense of its lower reactivity in chemical
reactions. The concept of chemical hardness has been
derived from the basis of the HOMO–LUMO energy gap
(Klopman and Iroff, 2004).
PCR
18 quantum descriptors were calculated for each molecule
studied. All the descriptors representing the electrostatic
potential together with all frontier orbital descriptors used
in this work are listed in Table 2.
In order to get the linear relationship with independent
variables, logarithms of the inverse of biological activity
(Log 1/IC50) data of 35 molecules were used.
PCA is a multivariate technique that in QSAR analyzes a
data matrix in which molecules are described by several
inter-correlated quantitative-dependent descriptors. Its goal
is to extract the important information from the matrix, to
represent it as a set of new orthogonal variables called
principal components, and to display the pattern of
Med Chem Res (2014) 23:2046–2061 2055
123
similarity of the observations and of the variables as points
in maps.
PCA was performed on the calculated quantum
descriptors. All the calculated PCs with their eigenvalues
are shown in Table 3. In this Table, the eigenvalues, the
percentage of variances explained by each eigenvalue, and
the cumulative percentage of variances are represented.
Therefore, we restricted the next studies to PCs and
selection of best subset of these PCs to perform linear and
non-linear regression methods.
Figure 2 shows how the 35 molecules are distributed in
the space of the first two principal components of the
quantum descriptor matrix. These two components retain
70 % of the variance. As can be seen in this figure, four
molecules are almost different in their quantum properties
(molecules no. 3, 6, 20, and 35). In practice, real data often
contain some outliers and usually they are not easy to be
separated from the data set. The need to determine outliers in
QSAR original data sets is important to insure model quality.
These molecules (as potential outliers) were retained in the
dataset and were investigated more in the determination of
applicability domain of developed models of this study.
At the first model, PCR, a multivariate projection
method was used for constructing a relationship between
quantum descriptors and pIC50s of compounds of interest.
In a typical PCR procedure, a PCA is followed by a MLR
step between the Y (pIC50s) matrix and the principal
components of the X quantum descriptors matrix.
Using the above procedure and factor scores as the
predictor parameters, the following equation was obtained:
pIC50 ¼ 6:414 �0:094ð Þ � 1:596 �0:368ð Þ� PC9 þ 0:382 �0:106ð Þ � PC6
þ 0:949 �0:445ð Þ � PC10 N ¼ 28; R2 ¼ 0:604
For evaluation of the predictive power of the generated
PCR model, the developed model was applied for
Table 3 The result of principle component analysis applied on the
calculated quantum descriptors
No. of PCs Eigen values Variance
explained by
each PCs
Cumulative
variance
explained
1 6.890 38.279 38.279
2 5.755 31.973 70.253
3 1.574 8.747 79.000
4 1.249 6.938 85.937
5 1.107 6.151 92.088
6 0.689 3.825 95.914
7 0.452 2.512 98.426
8 0.150 0.834 99.260
9 0.058 0.322 99.582
10 0.039 0.214 99.796
11 0.031 0.170 99.966
12 0.005 0.028 99.994
13 0.001 0.006 100.000
14 0.000 0.000 100.000
15 0.000 0.000 100.000
16 0.000 0.000 100.000
17 0.000 0.000 100.000
18 0.000 0.000 100.000
Fig. 2 Plot of PC1 versus PC2
of calculated quantum
descriptors for training and test
sets
2056 Med Chem Res (2014) 23:2046–2061
123
prediction of pIC50 values of all compounds in the
calibration and prediction set. The calculated pIC50 for
each molecule by model is summarized in Table 4.
Experimental versus predicted values for pIC50 values
of training and test set, obtained by the PCR modeling, are
shown graphically in Fig. 3a.
This model was validated by some statistics parameters
such as PRESS and RMSE and also by Tropsha parameters
tests and results are reported in Table 3. It is clear that this
model on the basis of statistics parameters and Tropsha
parameter must be rejected. With respect to these results,
we decided to try non-linear regression methods: radial
basis function artificial neural network, to obtain robust and
predictive model able to describe a relationship between
the structure and human glucagon receptor inhibitory
activity of the studied compounds.
RBFNN
Another way to find a relationship between the biological
activity and PCs is a non-linear modeling using PCs as
input and ANN as a regression tool.
In model formation step, an RBFNN was a built model
to make a relationship between PCs and the pIC50. This
model is called PCA-RBFNN after here.
Table 4 The experimental activity of the compounds used in this
study and their predicted values by PCR and ANN
Molecule no. Experimental
activity
Predicted
activity by
PCR
Predicted activity
by PCA-RBFNN
1 7.431 7.330 7.431
2 6.931 7.164 6.931
3 8.275 7.620 8.275
4 6.173 6.363 6.173
5 7.356 6.625 7.356
6 5.910 6.385 5.910
7 6.795 6.682 6.795
8 6.494 6.412 6.494
9 6.744 6.111 6.744
10 6.096 6.651 6.096
11a 5.853 6.698 5.855
12 6.376 6.054 6.376
13a 6.376 6.482 6.376
14 6.468 6.723 6.468
15a 5.552 5.963 5.552
16 5.772 6.250 5.772
17 5.866 6.184 5.866
18 7.301 6.537 7.301
19 6.585 6.490 6.585
20 7.301 6.527 7.301
21 6.721 6.510 6.721
22a 7.045 6.230 7.045
23 6.677 6.818 6.677
24 6.795 6.588 6.795
25a 6.522 6.555 6.522
26a 7.096 6.808 7.096
27 6.000 7.030 6.000
28 5.939 6.415 5.939
29 6.376 6.232 6.376
30a 5.576 6.383 5.576
31 4.847 4.803 5.506
32 5.546 5.130 5.546
33 5.324 5.881 5.433
34 5.424 6.014 5.424
35 5.841 5.835 5.373
a Molecules in test set assigned by Kennard and Stone algorithm
Fig. 3 Plot of predicted activity against the corresponding experi-
mental activity for: a PCR and b PCA-RBFNN models
Med Chem Res (2014) 23:2046–2061 2057
123
The input of the network was the eigenvalue ranked
PCs, the number of them to enter neural network varied
from 1 to 18, 8 PCs of them were selected as input of
networks. Using this number of PCs gave the best results
on the basis of the lowest root mean square error for
training set (RMSEC) and root mean square error of cross
validation (RMSECV) in the output of network (Fig. 4).
For the PCA-RBFNN model, the ‘‘spread’’ and the
number of the radial basis functions (the hidden layer units)
are the two important parameters influencing the perfor-
mances of the network developed. A robust model is
attained by selecting parameters that give the lowest error.
A special way based on response surface was used to
optimize these parameters. The surface plot of RMSECV
as a function of spread and number of nodes in hidden layer
is shown in Fig. 5. The results indicate that a PCA-RBFNN
with spread of 0.9 and 15 nodes in hidden layer resulted in
the optimum PCA-RBFNN performance.
The non-linear regression method was tuned using
training objects and it was evaluated by test samples.
Fig. 4 Optimization of number of PCs used in radial basis function
neural network
Fig. 5 Optimization of number
of neurons in hidden layer and
spread for RBF model
Table 5 Statistical parameters obtained for the ANN model
Parameter PCR PCA-RBFNN
Data set Training set Test set Training set Test set
N 28 7 28 7
R2 0.604 0.161 0.956 0.999
RMSE 0.498 0.874 0.154 0.001
PRESS 5.956 2.294 0.665 5.39E-6
R2LOO 0.511 0.987
RMSELOO 0.526 0.082
PRESSLOO 7.483 0.354
R2LMO 0.487 0.966
RMSELOO 0.549 0.099
PRESSLOO 8.004 0.377
R2 - R02/R2 -0.655 -2.996 -0.045
R2 - R0
02/R2 -0.651 -4.987 -0.045
k 1.000 0.975 0.998
k’ 0.995 1.017 1.001
Rm2 0.2242 0.0492 0.756
N number of molecules in data set, R2 correlation coefficient of
experimental and predicted activities, RMSE root mean square error,
PRESS predicted error sum of square, R2LOO correlation coefficient
of leave-one-out cross validation, RMSELOO root mean square error
of cross validation for leave-one-out cross validation, PRESSLOO
predictive residual sum of square of cross validation for leave-one-out
cross validation, R2LMO correlation coefficient of leave many out
cross validation, RMSELMO root mean square error of cross valida-
tion for leave many out cross validation, PRESSLOO predictive
residual sum of square of cross validation for leave many out cross
validation
2058 Med Chem Res (2014) 23:2046–2061
123
The predicted activities of pIC50 of training and test data
are listed in Table 4. Figure 3b depicts the plots of
observed versus predicted values for training and test set.
To compare the developed chemometrics methods in
predicting the activity of studied molecules, some statistics
for these models are included in Table 5. The statistical
parameters, such as R2 between the calculated and exper-
imental values obtained using the PCA-RBFNN for train-
ing and test sets, are shown in Table 5. Inspection of
RMSE and PRESS values for the developed regression
methods reveals the superiority of the PCA-RBFNN
method over PCR in predicting the inhibitory activity of
human glucagon receptor antagonists studied in this work.
PCA-RBFNN also shows good fitting between predicted
values and experimental values in various sets. As it can be
seen in Table 5, the model also passed successfully Trop-
sha and Roy recommended criteria for predictability.
Results confirm developed PCA-RBFNN model has the
highest statistical value.
A very important step in QSAR model development is
the definition of the applicability domain of the models.
The reason is figuring the point out that developed model is
valid only within the same domain for which it was
developed. We used standard residual of activity calculated
by developed models and leverage for assigning AD.
Values of leverage could be calculated for both training
and test compounds. Calculating leverage of training set is
useful for determining the compounds which resulted in the
instability in model. On the other hand, calculating
leverage for objects that were not used in model building
(such as test set) is useful for assigning the applicability
domain of the model.
Applicability of domain of PCA-RBFNN model is
shown in Fig. 6. Response outliers are compounds that
have standard residual points greater than the two standard
deviation units. Influential compounds are points with
leverage value higher than the warning leverage limit. As
can be seen in this figure all molecules in training and test
set lie in application domain of developed model. Only two
molecules (molecules 20 and 6) have a leverage value
higher than the warning leverage limit (0.771) but these
molecules have standard residual values between ±2.0 SD
units. Therefore, the molecules 20 and 6 can be considered
as effective in fitting performance of model but there are no
credible reasons to consider them as outlier to delete from
studied molecules. On the other hand, as can be noted in
Fig. 6, two of the molecules, 31 and 35, have standard
residual values higher than cutoff level, but show leverage
within the limit. As a result, it can be mentioned that none
of the studied molecules are both a response outlier and a
high leverage compound.
Conclusion
In this study, the ability of DFT calculations for the
development of quantitative structure–activity relation-
ships was assessed. B3LYP/6-311G was employed to
Fig. 6 Williams plot of
developed PCA-RBFNN model
Med Chem Res (2014) 23:2046–2061 2059
123
calculate the molecular geometries and quantum descrip-
tors of 35 2-pyridyl-3,5-diaryl pyrrole derivatives as human
glucagon receptor antagonists.
PCR and PCA-RBFNN as two linear and non-linear
regression methods, respectively, were investigated for
building models. A comparison between the developed
statistical methods revealed that PCA-RBFNN represented
superior results and it could predict about 95 % of vari-
ances in the inhibitory activity data and root mean square
error of prediction was 0.154. On the basis of the results
shown in Table 5, the non-linear methods (PCA-RBFNN)
were better than the PCR method considerably in the
goodness of fit and predictivity parameters and other cri-
teria for evaluation of the proposed model. These great
results for non-linear models reflect a non-linear relation-
ship between the principle components obtained from
quantum descriptors and the glucagon inhibitory activity
for studied 2-pyridyl-3,5-diaryl pyrrole derivatives. Exter-
nal validation showed the predictive ability of the gener-
ated QSAR model. The predictive ability of the obtained
QSAR models was also estimated according to Tropsha
et al., and Roy and all the criteria were passed. The
applicability domain of the model was defined by leverage
value. None of the studied compounds were outside the
domain of the model.
Acknowledgments The authors gratefully acknowledge Vice
Chancellor for Research and Technology, Kermanshah University of
Medical Sciences for financial support. This article resulted from the
Pharm. D thesis of Zohreh Nazari, major of Pharmacy, Kermanshah
University of Medical Sciences, Kermanshah, Iran.
References
Agatonovic-Kustrin S, Beresford R (2000) Basic concepts of artificial
neural network (ANN) modeling and its application in pharma-
ceutical research. J Pharm Biomed Anal 22(5):717–727
Arkan E, Shahlaei M, Pourhossein A, Fakhri K, Fassihi A (2010)
Validated QSAR analysis of some diaryl substituted pyrazoles as
CCR2 inhibitors by various linear and nonlinear multivariate
chemometrics methods. Eur J Med Chem 45(8):3394–3406
Arulmozhiraja S, Morita M (2004) Structure-activity relationships for
the toxicity of polychlorinated dibenzofurans: approach through
density functional theory-based descriptors. Chem Res Toxicol
17(3):348–356
Becke AD (1993) Density-functional thermochemistry. III. The role
of exact exchange. J Chem Phys 98:5648
Breneman CM, Rhem M (1997) QSPR analysis of HPLC column
capacity factors for a set of high-energy materials using electronic
van der waals surface property descriptors computed by transfer-
able atom equivalent method. J Comput Chem 18(2):182–197
Cartier A, Rivail J-L (1987) Electronic descriptors in quantitative
structure–activity relationships. Chemometrics and Intelligent
Laboratory Systems 1(4):335–347
Clare BW (1995) Structure-activity correlations for psychotomimet-
ics. III: Tryptamines. Aus J Chem 48(8):1385–1400
Clare BW, Supuran CT (1994) Carbonic anhydrase activators. 3:
structure-activity correlations for a series of isozyme II activa-
tors. J Pharm Sci 83(6):768–773
Clare BW, Supuran CT (1998) Semi-empirical atomic charges and
dipole moments in hypervalent sulfonamide molecules: descrip-
tors in QSAR studies. J Mol Struct (Thoechem) 428(1):109–121
Cronce DT, Famini G, De Soto J, Wilson L (1998) Using theoretical
descriptors in quantitative structure–property relationships: some
distribution equilibria. J Chem Soc Perkin Trans 2(6):1293–1302
Fassihi A, Shahlaei M, Moeinifard B, Sabet R (2012) QSAR study of
anthranilic acid sulfonamides as methionine aminopeptidase-2
inhibitors. Monatsh Chem 143(2):189–198
Frisch M, Trucks G, Schlegel Hea, Scuseria G, Robb M, Cheeseman
J, Montgomery J, Vreven T, Kudin K, Burant J (2008) Gaussian
03, revision C. 02
Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chemical
descriptors in QSAR/QSPR studies. Chem Rev 96(3):1027–1044
Kennard R, Stone L (1969) Computer aided design of experiments.
Technometrics 11(1):137–148
Klopman G, Iroff LD (2004) Calculation of partition coefficients by
the charge density method. J Comput Chem 2(2):157–160
Kohn W, Sham LJ (1965) Self-consistent equations including
exchange and correlation effects. APS 140:A1133–A1138
Lewis D, Lake B, Ioannides C, Parke D (1994) Inhibition of rat hepatic
aryl hydrocarbon hydroxylase activity by a series of 7-hydroxy
coumarins: QSAR studies. Xenobiotica 24(9):829–838
Mulliken R (1955a) Electronic population analysis on LCAO-MO
molecular wave functions. III. Effects of hybridization on overlap
and gross AO populations. J Chem Phys 23(12):2338–2342
Mulliken R (1955b) Electronic population analysis on LCAO-MO
molecular wave functions. IV. bonding and antibonding in
LCAO and valence-bond theories. J Chem Phys 23:2343
Mulliken R (1955c) Electronic population analysis on LCAO [single
bond] MO molecular wave functions. II. Overlap populations,
bond orders, and covalent bond energies. J Chem Phys 23:1841
Mulliken RS (1955d) Electronic Population Analysis on LCAO–MO
Molecular Wave Functions I. J Chem Phys 23(10):1833–1840
Ordorica M, Velazquez M, Ordorica J, Escobar J, Lehmann P (1993)
A principal component and cluster significance analysis of the
antiparasitic potency of praziquantel and some analogues. Quant
Struct Act Relat 12(3):246–250
Parr RG, Yang W (1989) Density-functional theory of atoms and
molecules, vol 16. Oxford University Press, Oxford
Pasha F, Srivastava H, Singh P (2005) Comparative QSAR study of
phenol derivatives with the help of density functional theory.
Bioorg Med Chem 13(24):6823–6829
Saghaie L, Shahlaei M, Madadkar-Sobhani A, Fassihi A (2010a)
Application of partial least squares and radial basis function
neural networks in multivariate imaging analysis-quantitative
structure activity relationship: study of cyclin dependent kinase 4
inhibitors. J Mol Graph Model 29(4):518–528
Saghaie L, Shahlaei M, Madadkar-Sobhani A, Fassihi A (2010b)
Application of partial least squares and radial basis function
neural networks in multivariate imaging analysis-quantitative
structure activity relationship: study of cyclin dependent kinase 4
inhibitors. J Mol Graph Model 29(4):518–528
Saghaie L, Shahlaei M, Fassihi A, Madadkar-Sobhani A, Gholivand
M, Pourhossein A (2011) QSAR analysis for some diaryl-
substituted pyrazoles as CCR2 inhibitors by GA-stepwise MLR.
Chem Biol Drug Des 77(1):75–85
Saghaie L, Sakhi H, Sabzyan H, Shahlaei M, Shamshirian D (2013)
Stepwise MLR and PCR QSAR study of the pharmaceutical
activities of antimalarial 3-hydroxypyridinone agents using
B3LYP/6-311??G** descriptors. Med Chem Res
22(4):1679–1688
2060 Med Chem Res (2014) 23:2046–2061
123
Shahlaei M, Fassihi A (2013) QSAR analysis of some 1-(3,3-
diphenylpropyl)-piperidinyl amides and ureas as CCR5 inhibi-
tors using genetic algorithm-least square support vector machine.
Med Chem Res 22:4384–4400
Shahlaei M, Pourhossein A (2012) A 2D image-based method for
modeling some c-Src tyrosine kinase inhibitors. Med Chem Res
22:3012–3025
Shahlaei M, Pourhossein A (2013) Modeling of CCR5 antagonists as
anti HIV agents using combined genetic algorithm and adaptive
neuro-fuzzy inference system (GA–ANFIS). Med Chem Res
1–14
Shahlaei M, Fassihi A, Saghaie L (2010a) Application of PC-ANN
and PC-LS-SVM in QSAR of CCR1 antagonist compounds: a
comparative study. Eur J Med Chem 45(4):1572–1582
Shahlaei M, Sabet R, Ziari MB, Moeinifard B, Fassihi A, Karbakhsh
R (2010b) QSAR study of anthranilic acid sulfonamides as
inhibitors of methionine aminopeptidase-2 using LS-SVM and
GRNN based on principal components. Eur J Med Chem
45(10):4499–4508
Shahlaei M, Fassihi A, Saghaie L, Arkan E, Madadkar-Sobhani A,
Pourhossein A (2011a) Computational evaluation of some
indenopyrazole derivatives as anticancer compounds; application
of QSAR and docking methodologies. J Enzym Inhib Med Chem
28:16–32
Shahlaei M, Madadkar-Sobhani A, Fassihi A, Saghaie L, Shamshirian
D, Sakhi H (2011b) Comparative quantitative structure–activity
relationship study of some 1-aminocyclopentyl-3-carboxyamides
as CCR2 inhibitors using stepwise MLR, FA-MLR, and GA-
PLS. Med Chem Res 21:100–115
Shahlaei M, Madadkar-Sobhani A, Saghaie L, Fassihi A (2011c)
Application of an expert system based on Genetic Algorithm–
Adaptive Neuro-Fuzzy Inference System (GA–ANFIS) in QSAR
of cathepsin K inhibitors. Expert Sys Appl 39:6182–6191
Shahlaie M, Fassihi A, Pourhossein A, Arkan E (2013) Statistically
validated QSAR study of some antagonists of the human CCR5
receptor using least square support vector machine based on the
genetic algorithm and factor analysis. Med Chem Res 1–16
Sotomatsu T, Murata Y, Fujita T (1989) Correlation analysis of
substituent effects on the acidity of benzoic acids by the AM1
method. J Comput Chem 10(1):94–98
Tetko I, Luik A, Poda G (1993) Applications of neural networks in
structure-activity relationships of a small number of molecules.
J Med Chem 36(7):811–814
Trohalaki S, Gifford E, Pachter R (2000) Improved QSARs for
predictive toxicology of halogenated hydrocarbons. Comput
Chem 24(3):421–427
Tuppurainen K, Lotjonen S, Laatikainen R, Vartiainen T, Maran U,
Strandberg M, Tamm T (1991) About the mutagenicity of
chlorine-substituted furanones and halopropenals. A QSAR
study using molecular orbital indices. Mutat Res 247(1):97–102
Xiang Y, Liu M, Zhang X, Zhang R, Hu Z, Fan B, Doucet J, Panaye A
(2002) Quantitative prediction of liquid chromatography reten-
tion of N-benzylideneanilines based on quantum chemical
parameters and radial basis function neural network. J Chem
Inf Comput Sci 42(3):592–597
Yan X-F, Xiao H-M, Gong X-D, Ju X-H (2005) Quantitative
structure–activity relationships of nitroaromatics toxicity to the
algae (Scenedesmus obliguus). Chemosphere 59(4):467–471
Med Chem Res (2014) 23:2046–2061 2061
123