prediction of glucagon receptor antagonist activities of some substituted imidazoles using combined...
TRANSCRIPT
ORIGINAL RESEARCH
Prediction of glucagon receptor antagonist activities of somesubstituted imidazoles using combined radial basis function neuralnetwork and density functional theory
Mohsen Shahlaei • Zohreh Nazari
Received: 29 July 2013 / Accepted: 22 October 2013
� Springer Science+Business Media New York 2013
Abstract QSAR study of human glucagon receptor (HGR)
ligands has been made with the help of quantum descriptors,
such as energy of HOMO, energy of LUMO, softness,
hardness using combination of principal component ana-
lysis, and radial basis function artificial neural network
(ANN). Quantum descriptors have been calculated via the
DFT-B3LYP method, with the basis set 6-311G. The
developed neural network QSAR model outperformed the
principal component regression model in both fitting and
predictive abilities. ANN analysis indicated that the esti-
mated activities were in total agreement with the experi-
mentally observed values (R2 = 0.869, RMSD = 0.186;
predictive Q2 = 0.732, RMSEcv = 0.346). The developed
models were further examined by means of an external
prediction set. The modeling study also reflected the
important role of quantum properties of molecules when they
interact with the target, HGR. The developed neural network
model is expected to be useful in the rational design of new
chemical entities as ligands of HGR and also for directing the
synthesis of potent molecules in the future.
Keywords Glucagon receptor inhibition activity �QSAR � Radial basis function neural network �Density functional theory
Introduction
Glucagon is a 29-amino acid peptide hormone secreted by
the a-cells in the pancreas, which is an important counter-
regulatory hormone in the control of glucose homeostasis
(Burcelin et al., 1996). Glucagon secretion from the a-cells
of pancreas induces maintaining glucose homeostasis by
stimulating gluconeogenesis and glycogenolysis in hepa-
tocytes and lipolysis in adipocytes during the hypoglyce-
mic state (Johnson et al., 1972). In the normal condition,
glucagon is synthesized and secreted in response to insuf-
ficient blood glucose levels
In healthy individuals, the glucagon binds to specific
receptors in the liver called the human glucagon receptor
(HGR). Binding of glucagon to HGR leads to and triggers
the synthesis of glucose (gluconeogenesis) and also pro-
cessing and releasing of hepatic glycogen stores (glyco-
genolysis) to restore blood glucose and maintain
homeostasis.
HGR is a G protein-coupled receptor binding of gluca-
gon to which stimulates cyclic AMP and Ca2? accumula-
tion as a result of adenylatecyclase activation (Trivedi
et al., 2000). In Type II diabetes, bihormonal hypothesis
implicates inappropriate secretion and activity of the two
major pancreatic hormones that control glucose homeo-
stasis, insulin, and glucagon (Unger and Orci, 1975).
In vivo research studies in various animal species imply
that neutralization of circulating glucagon alleviates
hyperglycemia (Brand et al., 1994). Hence, antagonists of
the HGR have the potential to modulate the rate of hepatic
M. Shahlaei
Novel Drug Delivery Research Center, School of Pharmacy,
Kermanshah University of Medical Sciences, Kermanshah, Iran
Z. Nazari
Student Research Committee, Kermanshah University of
Medical Sciences, Kermanshah, Iran
M. Shahlaei (&)
Department of Medicinal Chemistry, Faculty of Pharmacy,
Kermanshah University of Medical Sciences, Kermanshah, Iran
e-mail: [email protected]; [email protected]
123
Med Chem Res
DOI 10.1007/s00044-013-0869-9
MEDICINALCHEMISTRYRESEARCH
glucose output and improve insulin responsiveness in the
liver, resulting in a decrease in fasting plasma glucose
levels in diabetics (Chang et al., 2001).
Although several experimental procedures are available
for screening compounds for biological activity (e.g.,
in vivo and in vitro assay tests), all of those procedures,
however, have been performed using at least receptors and
other biological materials of human, rat, mouse, and calf
origin (Hill, 1972). These procedures are expensive, time-
consuming, and can potentially cause toxic by-products
from the experimental techniques employed currently. This
has implied that the development of computational proce-
dures as alternative tools for predicting the properties of
compounds has been a subject of intensive study.
Among computational procedures for drug design and
discovery, quantitative structure–activity relationship
(QSAR) has found various applications for predicting
chemical properties, including biological activity (Seiers-
tad and Agrafiotis, 2006), physical properties (Verma et al.,
2005), and toxicity (Khadikar et al., 2002).
QSAR models are common and rather successful
methods in drug design and computational discovery of the
new lead compounds.
QSAR models are essentially regression models in
which the independent variables are molecular descriptors
that explain the structure of molecules quantitatively, and
the dependent variable is the activity of interest usually
expressed as pIC50 (-log IC50) in the case of antagonist.
However, in cases of complex relationships, conventional
QSAR models (such as multiple linear regression) often
lead to insufficient or misleading information because of
nonlinear relationships within the studied dataset.
Moreover, for membrane-bound receptors (such as
HGR), the activity often results from both membrane
interaction and receptor binding, which may also lead to
nonlinear dependencies (Buyukbingol et al., 2007). One
possibility of overcoming the difficulties of such nonlin-
earities in QSAR research studies is the use of artificial
neural network (ANN) models, which has gained much
interest in the field of computational drug design (Valkova
et al., 2004; Arkan et al., 2010; Saghaie et al., 2010;
Shahlaei et al., 2010a, b, 2012a, 2013; Shahlaei and Fas-
sihi, 2012; Shahlaei, 2013). After a suitable learning step;
an ANN model should be able to ‘‘recognize’’ basic cor-
relations in a given dataset and to predict, for example,
pIC50 (Saghaie and Fassihi, 2012; Fassihi et al., 2012;
Shahlaei et al., 2012b). In this study, the principal com-
ponent analysis-based neural network analysis approach is
applied to build a QSAR model on a set of 35 substituted
imidazoles with known affinity for HGR. The goals of the
present study are to extract the relevant principal compo-
nent using quantum descriptors, to establish the QSAR of
the library of compounds, to establish the high predictive
ability of radial basis function neural network (RBFNN)
modeling on this library of the ligands, and to develop
insights regarding the relationship between the principal
component extracted from quantum descriptors of the
compounds of interest and their affinity for HGR; the
generated nonlinear ANN model is expected to be valuable
in the rational design of chemical modifications of HGR
antagonists to identify the most likely candidates for the
synthesis and discovery of new lead ligands.
Methods
Calculation of quantum descriptors
The biological data used in this study were glucagon receptor
inhibitory activities (in terms of -log IC50), of a set of 35 from
various compounds extracted from the research study by
Chang et al. (2001). The basic skeletons of the studied
compounds and details are summarized in Table 1. The
initial structures of all substituted imidazoles derivatives
were constructed using the software CS Chem3D (Ultra 10.0,
ChemOffice 2006, CambridgeSoft Corporation).
To save computational time, initial geometry optimiza-
tions were carried out by the molecular mechanics (MM)
method using the MM ? force fields. The resulting
geometries of all the 35 studied ligands were reoptimized
with density functional theory (DFT) method at the
B3LYP/6-31G level (Becke, 1993), and frequency calcu-
lations were performed at the same level for all of the
possible geometries to make sure that they are minimal on
the potential energy surface. DFT is a quantum mechanical
modeling technique based on the Hohenberg–Kohn theo-
rem (Hohenberg and Kohn, 1964) and the Kohn–Sham
method (Kohn and Sham, 1965) and is used to calculate the
ground-state electronic energy E0 and other ground-state
molecular properties from the ground-state electronic
density q0 instead of the electronic wave function. Since
the mid 1990s, the B3LYP level of theory (Lee et al., 1988;
Becke, 1993) has been the most extensively applied for
quantum calculations in molecules because of the accuracy
of the results obtained for a vast range of molecules, par-
ticularly organic molecules. All B3LYP level of theory
calculations were carried out by Gaussian 03 program
(Frisch et al., 2008).
As listed in Table 2, 18 quantum-chemical descriptors
were used to analyze their variations and efficiency of the
inhibition activity predictions of the compounds of interest.
The quantum descriptors employed in this study, such as
polarizability (a), dipole moment (l), energy of the highest
occupied molecular orbital (EHOMO), energy of the lowest
unoccupied molecular orbital (ELUMO), the most negative
atomic charge, the most positive charge, etc., have all been
Med Chem Res
123
Table 1 The main skeletons and details of structures used in this current study
N
NH
R1R3
R2
pIC50R3R2R1Compd
6.7958 NFBr1
6.886NF2
7.045 NF
Br
3
6.823 NF
Br
4
7.096 NFCl5
7.000 NFF6
7.000 NFI7
7.045 NFCH38
6.552 NF
CH3
CH3
9
6.522 NF10
7.154 NFNH211
7.096 NFOH12
7.000 NFOCH313
Med Chem Res
123
Table 1 continued
pIC50R3R2R1Compd
6.677 NFCN14
6.522 NFCO2CH315
6.958 NF
S Br16
6.721 NF
O Br17
6.468 NF18
6.920 NFH2CH2C19
7.397 NBr20
5.657577 Br21
7.000 OHBr22
7.698 N
H3C
FCl23
7.301 N
CH3
FCl24
7.638 NClCl25
6.853 NICl26
Med Chem Res
123
Table 1 continued
pIC50R3R2R1Compd
5.481 NCl27
6.229 NCl28
6.823 N
O
Cl29
6.657 N
O
Cl30
6.602 NOCl31
5.619 N
O CH2CH2CH3
O CH2CH2CH3
Cl32
N SCH3
F
Het
pIC50HetCompd
6.283
N
N
Me
33
6.065
N
NH
Me
34
6.031
N
O
35
Med Chem Res
123
obtained directly or indirectly (Table 2) from the Gaussian
output files.
The selection of input variables to ANN is essential to
avoid ‘‘over fitting’’ (Tetko et al., 1993) in the case of
many input descriptors offered. As a linear technique for
dimensionality reduction, PCA can transform the input
dataset from its original form (points in m-dimensional
space) to its new form (points in p-dimensional space),
where p is less than m. During the process, most of the
amounts of the variability of the original input dataset are
retained. Using the calibrated input dataset in a lower
dimension, smaller ANN is applied in the performance of
prediction.
Principal component analysis (PCA)
Next, a PCA was performed for variable reduction and data
interpretation. In PCA, descriptors describe the same
property clustering together, and hence, it is easy to
describe the predicted activity with less number of inde-
pendent variables.
In the PCA, at first, the data preprocessing must be
carried out on the descriptors calculated using mean cen-
tering and autoscaling. If k denotes the number of mole-
cules used in the regression, p the number of descriptors
which are calculated for each molecule, and yk,1 the matrix
of the activity (pIC50), then y is a vector of length k. and X
is a (k 9 p) matrix containing the calculated descriptors as
row vectors.
X ¼ CR ð1Þ
The matrix C and the eigenvalues [k] are given by
solving the eigenvalue problem:
CTZC ¼ ½k� ð2Þ
and the eigenvector matrix R (k 9 p) is calculated by
R ¼ CTX: ð3Þ
In Eq. (2) Z ¼ XXT denotes the (k 9 k) covariance
matrix, and [k] is the diagonal matrix of eigenvalues. The
rows in CT are the eigenvectors of Z, and its columns are
the ‘‘scores.’’ The column vectors of the square matrix
C are orthonormal and often called principal components
(PCs).
After generation of principal components, these scores
were used as new variables for regression.
Speaking in general, only p principal components are
enough to account for the most variances in an m-dimen-
sional dataset, where p is the number of important principal
components of the dataset, and m means the number of all
the principal components in the dataset. It is obvious that
p is less than m. Hence, PCA is generally regarded as a data
reduction technique. That is to say, a multidimensional
dataset can be projected to a lower dimension data space
without losing most of the information of the original
dataset by PCA.
Splitting PCs matrix into training and test sets
At the next step of developing QSAR models, and to
develop a reliable (validated) QSAR model, consecutive
molecules are selected and put alternatively in the training
and test sets. The division of an original dataset into the
training and test sets can be carried out using various
algorithms.
Ideally, this division must be carried out so that points
representing both training and test sets are distributed
within the whole descriptor space defined by the original
dataset, and each molecule of the test set is close to at least
one molecule of the training set. This method ensures that
the similarity principle can be used for the pIC50 prediction
of the test set. The Kennard–Stone (KS) algorithm (Ken-
nard and Stone, 1969) is well known in the field of data
splitting and has already found many uses in QSAR studies
(Arkan et al., 2010; Saghaie et al., 2010; Shahlaei et al.,
2010a, b).
Random sampling is a well-known method because of
its simplicity and also because a group of data randomly
extracted from a larger set follows the statistical distribu-
tion of the entire set. However, random sampling does
Table 2 Quantum electronic descriptors used in this study
Descriptor
abbreviation
Descriptor definition
EHOMO The energy of the highest occupied molecular
orbital
ELUMO The energy of the lowest occupied molecular
orbital
H–L The HOMO–LUMO energy gap
Electronegativity v ¼ ðEhomo�ElumoÞ2
Hardness g ¼ � ðEhomoþElumoÞ2
Electrophilicity x ¼ v2
2g
MPC The most positive charge
LNC The least negative charge
SSC The sum of square of charges
SSPC The sum of square of positive charges
SSNC The sum of square of negative charges
SPC The sum of positive charges
SNC The sum of negative charges
SAC The sum of absolute charge
DMx Dipole moment in x direction
DMy Dipole moment in y direction
DMz Dipole moment in z direction
TDM Total dipole moment
Med Chem Res
123
neither guarantee the representativeness of the set, nor does
it prevent extrapolation problems (Rajer-Kanduc et al.,
2003). In other words, random selection does not ensure
that the molecules on the boundaries of the set are included
in the training. An alternative to random selection tech-
nique, which is often used is the KS algorithm. KS algo-
rithm is aimed at spanning the multidimensional space in a
uniform manner by maximizing the Euclidean distances
between the descriptor vectors (x) of the selected molecules
(Wu et al., 1996).
In order to ensure a uniform distribution of such a
subset along the x (descriptors) data space, KS follows a
stepwise method in which new selections are taken in
regions of the space far from the molecules already
selected. Therefore, the method uses the Euclidean dis-
tances dx(p,q) between the x-vectors of each pair (p,q) of
molecules computed as
dXðp; qÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
J
j¼1
xpðjÞ � xqðjÞ� �2
;
v
u
u
t p; q 2 ½1; N�
For a typical QSAR, xp(j) and xq(j) are the descriptor
values at the jth descriptor for molecules p and q,
respectively. J denotes the number of descriptors in the
original dataset matrix. The selection procedure starts by
taking the pair (p1,p2) of molecules for which the Euclidean
distances dx(p1,p2) are the largest. In each subsequent
iteration, the method selects the molecule that has the
largest minimum Euclidean distance with respect to any
molecule already selected. Such a procedure is repeated until
the number of molecules specified by the user is achieved.
Radial basis function neural networks (RBFNNs)
In the present study, one type of neural networks, namely
RBFNN, was employed to establish an alternative nonlin-
ear model. The theory of RBFNN has been adequately
described in detail elsewhere (Xiang et al., 2002). Hence,
we will limit ourselves to a brief outline highlighting only
the most important aspects.
Usually, RBFNN comprises three layers: the input layer,
the hidden layer, and the output layer (Fig. 1). The input
layer does not process the information, since it only dis-
tributes the input vectors to the hidden layer, whereas the
latter consists of a number of RBF units (nh) and biases
(bk). Each neuron on the hidden layer employs a radial
basis function (RBF) as a nonlinear transformation function
to operate on the input data. The more frequently used RBF
is a Gaussian function that is characterized by a center and
a width. This function measures the Euclidean distance
between the input vector (X) and the center (cj) and per-
forms the nonlinear transformation within the hidden layer
as follows:
Hj ¼ expð� X � cj
�
�
�
�
�
�
2
=r2j Þ ð4Þ
denoting j the output of the jth RBF unit, while cj and rj are
the center and the width of such unit, respectively. The
operation of the output layer is linear and is given by:
ykðxÞ ¼X
nh
i¼1
WjHj þW0 ð5Þ
where yk is the kth output unit for the input vector X, Wj is
the weight connection between the kth output unit and the
jth hidden layer unit, and bk is the respective bias (Fig. 1).
From Eqs. (4) and (5), one can see that the design of
RBFNN involves selecting centers, number of hidden layer
units, widths, and weights. There are various methods for
selecting the centers, such as random subset selection,
k-means clustering, and RBF–PLS. In this study, a forward
subset selection routine was used to select the centers from
the training set samples. As regards the widths of the radial
basis functions, they can either be chosen equal for all the
units or different for each unit. Here, we limited ourselves to
Gaussian functions with a constant width for all of the units.
Furthermore, the adjustment of the connection weight
between the hidden layer and output layer was performed
using a least-squares solution after the selection of the RBF
centers and width. All RBFNN calculations were performed
using home-developed scripts using the MATLAB package
(www.mathworks.com/products/matlab/).
Moreover, the overall performance of the final RBFNN
model was evaluated in terms of its root mean squared
error (RMS), and its goodness and robustness estimated by
the same statistical parameters as those used for the linear
model Fig. 2.
Fig. 1 A typical Radial Basis Function network architecture with
n and j neurons in the input (X) and hidden (H) layers, respectively,
and a single neuron in the output layer (Y). An activation function
(drawn inside the circles) depicts what happens at a given neuron.
Wis represent the weights
Med Chem Res
123
Validation and evaluation
Testing the stability, predictive power, and generalization
ability of the models is a very important step in QSAR
study. There are several tools to estimate and calculate the
accuracy and also the validity of the proposed QSAR
model as well the impacts of the preprocessing steps that
can be categorized into two sets: internal validation, and
external validation.
Some of the common parameters used for checking the
predictability of proposed models are root mean square
error (RMSE), square of the correlation coefficient (R2),
and a predictive residual error sum of squares (PRESS).
These parameters were calculated for each model as
follows:
RMSE ¼ 1=nX
n
i¼1
ðy� yiÞ2" #2
ð6Þ
R2 ¼X
n
i¼1
ðyi � �yÞ2,
X
n
i¼1
ðyi � �yiÞ ð7Þ
PRESS ¼X
n
i¼1
ðyi � yiÞ2 ð8Þ
where yi is the true bioactivity of the investigated com-
pound i,yi represents the calculated bioactivity of the
compound i, �xis the mean of true activity in the studied set,
and n is the total number of molecules used in the studied
sets.
The cross validation is one of the most popular methods
for internal validation. In this study, the internal predictive
capability of the model was evaluated by leave-one-out
cross-validation (Q2LOO). A good Q2LOO often indicates
a good robustness and high internal predictive power of a
QSAR model. However, recent studies of Tropsha et al.
(2003) indicate that there is no evident correlation between
the value of Q2LOO and actual predictive power of a
QSAR model, revealing that the Q2LOO is still inadequate
for a reliable estimates of model’s predictive ability for all
new compounds. In order to determine both the general-
izability of QSAR models for new compounds and the true
predictive ability of the models, the statistical external
validation can be used at the model development step by
properly employing a prediction set for validation. The
results of data splitting using KS algorithm are shown in
Table 1, as the test set is indicated with an asterisk.
Results and discussion
Many molecular properties depend on intermolecular
interactions. The main component of these interactions is
electrostatic in its nature. Electrical charges in the molecule
are simply the driving force of electrostatic interactions.
Charge-based descriptors have therefore widely been used
as chemical reactivity indices or as measures of weak
intermolecular interactions. The charge distributions in a
given molecule and partial charges on the atoms can be
estimated using quantum-chemical calculations. One of the
most important parts of standard output of almost any
quantum calculation is the Mulliken atomic charges
(Mulliken 1955a, b, c, d,). Usually, the minimum (most
negative) and maximum (most positive) atomic partial
charges in the molecule or the minimum or maximum
partial charges for particular types of atoms are employed
as quantum descriptors (Clare and Supuran, 1994; Cartier
and Rivail 1987). Different sums of absolute or square
values of partial charges (e.g., sum of positive charges and
sum of square of positive charges) have also been
employed to explain intermolecular interactions. Other
Fig. 2 scores’ plots on the first three principal components of PCA
Med Chem Res
123
usual charge-based quantum-chemical indices used as
electrostatic descriptors in QSAR models are the average
absolute atomic charge (Clare and Supuran, 1994; Ordorica
et al., 1993) and a polarity parameter defined as the dif-
ference between the values of the most positive and neg-
ative charges (Clare and Supuran, 1994; Cartier and Rivail,
1987; Clare, 1995).
Electrostatic interactions can also be denoted by the
respective electrical moments and their components. The
polarity is denoted by the dipole moment (l). The polariza-
tion of a molecule by an external electric field can be defined
in terms of nth order susceptibility tensors of the molecule
(Sotomatsu et al., 1989). The first-order term that is referred
to as the polarizability of the molecule represents the relative
susceptibility of the electron cloud of an atom or a molecule
to be distorted from its normal shape by the presence of an
external field. Due to this distortion, an induced electric
dipole moment appears. Polarizability (a) is a tensor relating
the induced dipole moment (l ind) to the applied electric
field strength. The nondiagonal elements of the tensor rep-
resent the polarizability of the electrons along one of the axes
of the coordinate system due to a component of the applied
electric field along other coordinate axes. As this effect is
insignificant compared to the polarizability in the direction
of the applied electric field, the nondiagonal elements of the
polarizability tensor are zero or very small compared with
the diagonal elements. The polarizability is therefore rep-
resented in practice as ‘‘mean polarizability,’’ i.e., the aver-
age polarizability over the three axes of the molecule, and
equals one-third of the trace. It has been shown that a is
related to the molecular volume (Lewis et al., 1994),
hydrophobicity (Breneman and Rhem, 1997), and the elec-
trophilic superdelocalizability (Clare and Supuran, 1998).
According to classical chemical theory, all chemical
interactions are by nature either electrostatic (polar) or
orbital (covalent) driven. In quantum chemistry, covalent
interactions arise from orbital overlap. The interaction of
two orbitals depends on their energy eigenvalues. Conse-
quently, energies associated with the highest occupied
molecular orbital (EHOMO) and the lowest unoccupied
molecular orbital (ELUMO) are often good candidates for
2D descriptors. For example, EHOMO might model the
covalent basicity of a hydrogen bond acceptor or the
ELUMO might model the covalent acidity of the proton of a
H bond donor. Further interpretation is possible because
the HOMO energy is related to the ionization potential and
is a measure of the molecule’s tendency to be attacked by
electrophiles. Correspondingly, the LUMO energy is rela-
ted to the electron affinity and is a measure of a molecule’s
tendency to be attacked by nucleophiles (Tuppurainen
et al., 1991). Furthermore, according to frontier molecular
orbital theory, transition state formation involves the
interaction between the frontier orbitals of reacting species.
The HOMO–LUMO gap, i.e., the difference between the
EHOMO and the ELUMO is an important stability index
(Lewis et al., 1994).
A large HOMO–LUMO gap implies high stability for
the molecule in terms of its lower reactivity in chemical
reactions. The concept of chemical hardness has been
derived from the basis of the HOMO–LUMO energy gap
(Klopman and Iroff, 2004).
Eighteen quantum descriptors were calculated for each
of the studied molecules. All the descriptors representing
the electrostatic potential, together with all frontier orbital
descriptors used in this study are listed in Table 2.
In order to get the linear relationship with independent
variables, logarithms of the inverse of biological activity
(Log 1/IC50) data of 35 molecules were used.
PCA is a multivariate technique in QSAR that analyzes a
data matrix in which molecules are described by several
intercorrelated quantitative-dependent descriptors. Its goal
is to extract the important information from the matrix to
represent it as a set of new orthogonal variables called
principal components, and to display the pattern of similarity
of the observations and of the variables as points in maps.
PCA was performed on the calculated quantum
descriptors. All the calculated PCs with their eigenvalues
are shown in the Table 3. In this table, the eigenvalues, the
percentage variances explained by each eigenvalue and the
cumulative percentage variances are represented. There-
fore, we limited the further studies to PCs and selection of
Table 3 Percentage variance by the PCA analysis carried out on the
original matrix of the quantum descriptors
PC no. Eigenvalue Variance (%) Cumulative
variance (%)
1 8.081 44.893 44.893
2 4.409 24.494 69.386
3 1.567 8.706 78.092
4 1.394 7.745 85.838
5 0.720 4.001 89.839
6 0.632 3.509 93.348
7 0.521 2.893 96.241
8 0.430 2.390 98.631
9 0.195 1.086 99.717
10 0.034 0.188 99.905
11 0.012 0.066 99.971
12 0.003 0.016 99.987
13 0.001 0.007 99.994
14 0.001 0.005 99.998
15 0.000 0.002 100.000
16 0.000 0.000 100.000
17 0.000 0.000 100.000
18 0.000 0.000 100.000
Med Chem Res
123
the best subset of these PCs to perform linear and nonlinear
regression methods.
PCA, despite outstanding properties, is known to have
some inadequacies. One such inadequacy is that it is
strongly influenced by the presence of outliers. In QSAR
studies, the outliers are the molecules exhibiting very dif-
ferent values for some of the calculated descriptors in
comparison with the majority of molecules. Hence, the
obtained PCs will not explain the majority of the data well,
and one cannot get a suitable insight into the data structure.
A method to deal with this problem is to remove the out-
lying molecules detected on the score plots and to repeat
the PCA procedure. The first three PCs accounted for
93.3 % of the total variances in PCA, and the sample
scores for all the 203 samples on the first two PCs are
shown in Fig. 5—obvious outliers.
Table 3 shows the eigenvalues and percentage variance
for the principal components extracted on the original
quantum data matrix. As seen in this table, the largest part
of information has been reduced to the first principal
component (44.89 %). The second and third principal
components comprise a lower amount of information
(24.49 and 8.71 %, respectively), by summing the overall
information content of 100 %. The minimal eigenvalue
was set to 0.15. The information content of the tenth
eigenvalue was 0.188 (negligible information content
retained), corresponding to an eigenvalue of 0.034 which
was less than the chosen limit, while that of the ninth
eigenvalue was 1.086 %, corresponding to an eigenvalue of
0.195, greater than the chosen limit. This means that the
remaining variance in the PCs after nine PCs represents
only noise in the primary data matrix, and the most sig-
nificant information content is contained in the first nine
PCs.
After dividing the molecules into two parts using PCs as
variables, calibration, and validation sets, based on KS
algorithm, building of regression model using training set
was carried out.
In model formation step, a RBFNN was the built model
to create a relationship between PCs and the pIC50. This
model is called PCA-RBF-ANN henceforth.
The input of the network was the eigenvalue-ranked
PCs, the number of which to enter neural network varied
from 1 to 18, 9 of these PCs were selected as input of
networks. Using this number of PCs gave the best results
on the basis of the lowest root mean square error for
training set (RMSET) and root mean square error of cross-
validation (RMSECV) in the output of network (Fig. 3).
For the PCA-RBF-ANN model, the ‘‘spread’’ and the
number of the RBFs (the hidden layer units) are the two
Fig. 4 The optimization of
number of nodes in hidden layer
and spread value for radial basis
function neural network using
RMSECV
Fig. 3 Optimization of number of PCs used in ANN model
Med Chem Res
123
important parameters influencing the performances of the
network developed. The selection of the optimal width
value for RBF-ANN was carried out by systemically
changing its value in the training step. The values which
gave the best RMSE LOO cross-validation result were used
in the model. Each minimum error on LOO cross-valida-
tion was plotted against the width (Fig. 4), and the mini-
mum was chosen as the optimal condition. Finally, the
number of the hidden layer units was 18, and the optimal
spread was 1.7.
Based on the above optimization, a 9-18-1 ANN-RBF
model was finally constructed. The predicted data of the
RBF-ANN model are shown in Table 4 and the plot of the
predicted and experimental values of both training and test
sets are shown in Fig. 5.
Developed QSAR models can be validated by different
theoretical tools. These theoretical validation tools can be
generally categorized into two categories: internal valida-
tion (which does not use any external source of data for
verification of predictability of generated models), and
external validation tools (which employs, for validation, a
new set of data not applied in the model fitting exercise).
One of the most important methods in the internal valida-
tion tools includes leave-one out cross-validation (LOO-
CV).
The statistical parameters obtained by generated model
for the training and prediction sets are shown in Table 5.
The GA-ANFIS has the RMSE of 0.268 for the training set
and 0.362 for the test set. The squared correlation coeffi-
cient (R2) of the training set is 0.916, and of the test set is
0.932.
The usefulness of QSAR models is their capability not
only to regenerate known data, but also to generate a good
estimation of bioactivity for any external object (Gramatica
and Papa, 2003). The predictabilities of the generated
PCA–RBF–ANN model are severely affected by the over-
fitting problem. In QSAR modeling, over-fitting is acquired
when uninformative independent variables enter the model.
There are some techniques to approximate the quality of
the developed models (Golbraikh and Tropsha, 2002;
Gramatica et al., 2007). Cross-validation is the most
commonly employed validation technique (Zhang and
Fig. 5 The plot of the predicted and experimental values of both
training and test sets
Table 4 Experimental and predicted activities for studied molecules
Molecule No. Y Ypred REP
1 6.80 6.91 -1.62E-02
2 6.89 6.89 -9.63E-06
3 7.05 7.04 5.40E-07
4 6.82 6.82 -5.24E-05
5 7.10 7.10 1.13E-04
6 7.00 7.00 2.35E-04
7 7.00 7.00 1.63E-04
8 7.05 7.05 -3.15E-05
9 6.55 6.55 -3.36E-04
10 6.52 6.52 1.58E-05
11 7.15 7.15 -6.64E-06
12 7.10 7.10 4.31E-06
13 7.00 7.00 6.59E-06
14 6.68 6.68 -2.42E-06
15 6.52 6.52 2.26E-06
16 6.96 6.96 -6.90E-06
17 6.72 7.10 -5.35E-02
18 6.47 6.47 -3.20E-07
19 6.92 6.65 4.11E-02
20 7.40 7.40 -4.51E-06
21 5.66 5.66 -1.53E-06
22 7.00 7.00 4.25E-06
23 7.70 7.70 3.28E-05
24 7.30 7.30 2.42E-05
25 7.64 7.12 7.34E-02
26 6.85 6.85 1.14E-05
27 5.48 6.17 -1.12E-01
28 6.23 6.23 6.53E-05
29 6.82 6.83 -3.14E-04
30 6.66 6.66 1.10E-05
31 6.60 6.60 3.12E-04
32 5.62 5.62 5.74E-06
33 6.28 6.28 2.18E-05
34 6.07 6.06 3.20E-06
35 6.03 6.03 4.56E-05
Y experimental activity
Y calculated activity using model
REP relative error of prediction
Med Chem Res
123
Tropsha, 2000). Therefore, to examine the predictability
and over-fitting of the generating model, the LOO-CV
procedure was used. The squared correlation coefficient for
cross-validation (Q2) was then estimated by the following
equation Q2 = 1-(PRESS-SSD), where PRESS and SSD
are the predicted residual sum of squares and the sum of the
squared deviations from the mean, respectively. The result
of LOO-CV procedure for training the subset of investi-
gated compounds is reported in the Table 5. The cross-
validation results show that the generated model has Q2
value greater than 0.732; therefore, the developed model is
a reasonable QSAR model.
Conclusion
In this study, DFT-derived quantum-chemical descriptors
such as local charges and electrostatic potentials of atoms,
dipole moments, and HOMO and LUMO energies in
combination with principal component analysis and
RBFNN were investigated for building quantitative struc-
ture–activity relationship for the prediction of glucagon
receptor inhibitory activity of a series of 35 imidazole
derivatives.
A proper model with high statistical quality and low
prediction errors was obtained. The predictive power of the
developed model was further demonstrated using an
internal validation method by dividing the 35 studied
compounds into subgroups as training and test sets. This
validation method also confirmed the predictive power of
the generated RBFNN.
The results for nonlinear model reflect a nonlinear
relationship between the principal components obtained
from quantum-electronic molecular descriptors and the
glucagon receptor-inhibitory activity for the studied set of
molecules.
With the rapid advancement of computer hardware and
software and DFT method, electronic structure theory had
become an efficient and routine tool in explaining the
structural features of an extensive range of molecules.
Therefore, in the present study, we endeavored to extend
the application field of traditional QSAR analysis by
combining the electronic structure theory calculation with
RBFNN.
Acknowledgments The authors are grateful to the Vice Chancellor
for Research and Technology, Kermanshah University of Medical
Sciences for the financial support. This article resulted from the
Pharm. D thesis of Zohreh Nazari, Major of Pharmacy, Kermanshah
University of Medical Sciences, Kermanshah, Iran.
References
Arkan E, Shahlaei M, Pourhossein A, Fakhri K, Fassihi A (2010)
Validated QSAR analysis of some diaryl substituted pyrazoles as
CCR2 inhibitors by various linear and nonlinear multivariate
chemometrics methods. Eur J Med Chem 45:3394–3406
Becke AD (1993) Density-functional thermochemistry. III. The role
of exact exchange. J Chem Phys 98:5648
Brand C, Rolin B, Jørgensen P, Svendsen I, Kristensen J, Holst J
(1994) Immunoneutralization of endogenous glucagon with
monoclonal glucagon antibody normalizes hyperglycaemia in
moderately streptozotocin-diabetic rats. Diabetologia
37:985–993
Breneman CM, Rhem M (1997) QSPR analysis of HPLC column
capacity factors for a set of high-energy materials using
electronic van der waals surface property descriptors computed
by transferable atom equivalent method. J Comput Chem
18:182–197
Burcelin R, Katz E, Charron M (1996) Molecular and cellular aspects
of the glucagon receptor: role in diabetes and metabolism.
Diabetes Metab 22:373–396
Buyukbingol E, Sisman A, Akyildiz M, Alparslan FN, Adejare A
(2007) Adaptive neuro-fuzzy inference system (ANFIS): a new
approach to predictive modeling in QSAR applications: a study
of neuro-fuzzy modeling of PCP-based NMDA receptor antag-
onists. Bio Org Med Chem 15:4265–4282
Cartier A, Rivail J-L (1987) Electronic descriptors in quantitative
structure—activity relationships. Chemom Intell Lab Syst
1(4):335–347
Chang LL, Sidler KL, Cascieri MA, de Laszlo S, Koch G, Li B,
MacCoss M, Mantlo N, O’Keefe S, Pang M (2001) Substituted
imidazoles as glucagon receptor antagonists. Bio Org Med Chem
Lett 11:2549–2553
Clare BW (1995) Structure-activity correlations for psychotomimet-
ics. III. Tryptamines. Aust J Chem 48:1385–1400
Clare BW, Supuran CT (1994) Carbonic anhydrase activators. 3:
structure-activity correlations for a series of isozyme II activa-
tors. J Pharm Sci 83:768–773
Clare BW, Supuran CT (1998) Semi-empirical atomic charges and
dipole moments in hypervalent sulfonamide molecules: descrip-
tors in QSAR studies. J Mol Struct Theochem 428:109–121
Fassihi A, Shahlaei M, Moeinifard B, Sabet R (2012) QSAR study of
anthranilic acid sulfonamides as methionine aminopeptidase-2
inhibitors. Monatshefte fur Chemie-Chemical Monthly
143:189–198
Table 5 Statistical parameters obtained for the ANN model
Parameter GA-MLR
Dataset Training set Test set
N 28 7
R2 0.869 0.994
RMSE 0.186 0.042
PRESS 0.970 0.012
Q2 0.732 –
RMSECV 0.346 –
PRESSCV 2.157 –
N number of molecules in dataset, R2 correlation coefficient of
experimental and predicted activities, RMSE root mean square error,
PRESS Predicted error sum of square, Q2 correlation coefficient of
leave-one out cross-validation, RMSECV root mean square error of
cross-validation, PRESSCV predictive residual sum of square of cross-
validation
Med Chem Res
123
Frisch M, Trucks G, Schlegel H, Scuseria G, Robb M, Cheeseman J,
Montgomery J, Vreven T, Kudin K, Burant J (2008) Gaussian
03, revision C. 02
Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model
20(4):269–276
Gramatica P, Papa E (2003) QSAR modeling of bioconcentration
factor by theoretical molecular descriptors. QSAR Comb Sci
22:374–385
Gramatica P, Giani E, Papa E (2007) Statistical external validation
and consensus modeling: a QSPR case study for Koc prediction.
J Mol Graph Model 25:755–766
Hill DL (1972) The biochemistry and physiology of tetrahymena, vol
230. Academic Press, New York
Hohenberg P, Kohn W (1964) Inhomogeneous electron gas. Phys Rev
136:B864
Johnson MEM, Das NM, Butcher FR, Fain JN (1972) The regulation
of gluconeogenesis in isolated rat liver cells by glucagon,
insulin, dibutyryl cyclic adenosinemonophosphate, and fatty
acids. J Biol Chem 247:3229–3235
Kennard R, Stone L (1969) Computer aided design of experiments.
Technometrics 11:137–148
Khadikar PV, Phadnis A, Shrivastava A (2002) QSAR study on
toxicity to aqueous organisms using the PI index. Bio Org Med
Chem 10:1181–1188
Klopman G, Iroff LD (2004) Calculation of partition coefficients by
the charge density method. J Comput Chem 2:157–160
Kohn W, Sham LJ (1965) Self-consistent equations including
exchange and correlation effects. Phys Rev 140:A1133–A1138
Lee C, Yang W, Parr RG (1988) Development of the Colle–Salvetti
correlation-energy formula into a functional of the electron
density. Phys Rev B 37:785–789
Lewis D, Lake B, Ioannides C, Parke D (1994) Inhibition of rat
hepatic aryl hydrocarbon hydroxylase activity by a series of
7-hydroxy coumarins: QSAR studies. Xenobiotica 24:829–
838
Mulliken R (1955a) Electronic population analysis on LCAO–MO
molecular wave functions. III. Effects of hybridization on
overlap and gross AO populations. J Chem Phys 23:2338–2342
Mulliken R (1955b) Electronic population analysis on LCAO-MO
molecular wave functions. IV. Bonding and antibonding in
LCAO and valence-bond theories. J Chem Phys 23:2343
Mulliken R (1955c) Electronic population analysis on LCAO [single
bond] MO molecular wave functions. II. Overlap populations,
bond orders, and covalent bond energies. J Chem Phys 23:1841
Mulliken RS (1955d) Electronic population analysis on LCAO MO
molecular wave functions. I. J Chem Phys 23:1833–1840
Ordorica M, Velazquez M, Ordorica J, Escobar J, Lehmann P (1993)
A principal component and cluster significance analysis of the
antiparasitic potency of praziquantel and some analogues. Quant
Struct-Act Relat 12:246–250
Rajer-Kanduc K, Zupan J, Majcen N (2003) Separation of data on the
training and test set for modelling: a case study for modelling of
five colour properties of a white pigment. Chemom Intell Lab
Syst 65:221–229
Saghaie MS, A Fassihi L (2012) Quantitative structure activities
relationships of some 2-mercaptoimidazoles as CCR2 inhibitors
using genetic algorithm-artificial neural networks. Res Pharm
Sci 8:97–112
Saghaie L, Shahlaei M, Madadkar-Sobhani A, Fassihi A (2010)
Application of partial least squares and radial basis function
neural networks in multivariate imaging analysis-quantitative
structure activity relationship: study of cyclin dependent kinase 4
inhibitors. J Mol Graph Model 29:518–528
Seierstad M, Agrafiotis DK (2006) A QSAR model of hERG binding
using a large, diverse, and internally consistent training set.
Chem Biol Drug Des 67:284–296
Shahlaei M (2013) Descriptor selection methods in quantitative
structure–activity relationship studies: a review study. Chem Rev
113(10):8093–8103
Shahlaei M, Fassihi A (2012) QSAR analysis of some 1-(3,
3-diphenylpropyl)-piperidinyl amides and ureas as CCR5 inhib-
itors using genetic algorithm-least square support vector
machine. Med Chem Res 22:4384–4400
Shahlaei M, Fassihi A, Saghaie L (2010a) Application of PC-ANN
and PC-LS-SVM in QSAR of CCR1 antagonist compounds: a
comparative study. Eur J Med Chem 45:1572–1582
Shahlaei M, Sabet R, Ziari MB, Moeinifard B, Fassihi A, Karbakhsh
R (2010b) QSAR study of anthranilic acid sulfonamides as
inhibitors of methionine aminopeptidase-2 using LS-SVM and
GRNN based on principal components. Eur J Med Chem
45:4499–4508
Shahlaei M, Madadkar-Sobhani A, Fassihi A, Saghaie L, Arkan E
(2012a) QSAR study of some CCR5 antagonists as anti-HIV
agents using radial basis function neural network and general
regression neural network on the basis of principal components.
Med Chem Res 21:3246–3262
Shahlaei M, Madadkar-Sobhani A, Fassihi A, Saghaie L, Shamshirian
D, Sakhi H (2012b) Comparative quantitative structure–activity
relationship study of some 1-aminocyclopentyl-3-carboxyamides
as CCR2 inhibitors using stepwise MLR, FA-MLR, and GA-
PLS. Med Chem Res 21(1):100–115
Shahlaei M, Fassihi A, Saghaie L, Arkan E, Madadkar-Sobhani A,
Pourhossein A (2013) Computational evaluation of some indenopy-
razole derivatives as anticancer compounds; application of QSAR
and docking methodologies. J Enz Inhib Med Chem 28:16–32
Sotomatsu T, Murata Y, Fujita T (1989) Correlation analysis of
substituent effects on the acidity of benzoic acids by the AM1
method. J Comput Chem 10:94–98
Tetko I, Luik A, Poda G (1993) Applications of neural networks in
structure-activity relationships of a small number of molecules.
J Med Chem 36:811–814
Trivedi D, Lin Y, Ahn J-M, Siegel M, Mollova NN, Schram KH,
Hruby VJ (2000) Design and synthesis of conformationally
constrained glucagon analogues. J Med Chem 43:1714–1722
Tropsha A, Gramatica P, Gombar V (2003) The importance of being
earnest: validation is the absolute essential for successful
application and interpretation of QSPR models. QSAR Comb
Sci 22:69–77
Tuppurainen K, Lotjonen S, Laatikainen R, Vartiainen T, Maran U,
Strandberg M, Tamm T (1991) About the mutagenicity of chlo-
rine-substituted furanones and halopropenals. A QSAR study
using molecular orbital indices. Mutat Res 247:97–102
Unger R, Orci L (1975) The essential role of glucagon in the
pathogenesis of diabetes mellitus. Lancet 305:14–16
Valkova I, Vracko M, Basak SC (2004) Modeling of structure–
mutagenicity relationships: counter propagation neural network
approach using calculated structural descriptors. Anal Chim Acta
509:179–186
Verma RP, Kurup A, Hansch C (2005) On the role of polarizability in
QSAR. Bio Org Med Chem 13:237–255
Wu W, Walczak B, Massart D, Heuerding S, Erni F, Last I, Prebble K
(1996) Artificial neural networks in classification of NIR spectral
data: design of the training set. Chemom Intell Lab Syst
33:35–46
Xiang Y, Liu M, Zhang X, Zhang R, Hu Z, Fan B, Doucet J, Panaye A
(2002) Quantitative prediction of liquid chromatography reten-
tion of N-benzylideneanilines based on quantum chemical
parameters and radial basis function neural network. J Chem
Inf Comput Sci 42:592–597
Zhang W, Tropsha A (2000) Novel variable selection quantitative
structure–property relationship approach based on the k-nearest-
neighbor principle. J Chem Inf Comput Sci 40:185–194
Med Chem Res
123