prediction of glucagon receptor antagonist activities of some substituted imidazoles using combined...

ORIGINAL RESEARCH

Prediction of glucagon receptor antagonist activities of somesubstituted imidazoles using combined radial basis function neuralnetwork and density functional theory

Mohsen Shahlaei • Zohreh Nazari

Received: 29 July 2013 / Accepted: 22 October 2013

� Springer Science+Business Media New York 2013

Abstract QSAR study of human glucagon receptor (HGR)

ligands has been made with the help of quantum descriptors,

such as energy of HOMO, energy of LUMO, softness,

hardness using combination of principal component ana-

lysis, and radial basis function artificial neural network

(ANN). Quantum descriptors have been calculated via the

DFT-B3LYP method, with the basis set 6-311G. The

developed neural network QSAR model outperformed the

principal component regression model in both fitting and

predictive abilities. ANN analysis indicated that the esti-

mated activities were in total agreement with the experi-

mentally observed values (R2 = 0.869, RMSD = 0.186;

predictive Q2 = 0.732, RMSEcv = 0.346). The developed

models were further examined by means of an external

prediction set. The modeling study also reflected the

important role of quantum properties of molecules when they

interact with the target, HGR. The developed neural network

model is expected to be useful in the rational design of new

chemical entities as ligands of HGR and also for directing the

synthesis of potent molecules in the future.

Keywords Glucagon receptor inhibition activity �QSAR � Radial basis function neural network �Density functional theory

Introduction

Glucagon is a 29-amino acid peptide hormone secreted by

the a-cells in the pancreas, which is an important counter-

regulatory hormone in the control of glucose homeostasis

(Burcelin et al., 1996). Glucagon secretion from the a-cells

of pancreas induces maintaining glucose homeostasis by

stimulating gluconeogenesis and glycogenolysis in hepa-

tocytes and lipolysis in adipocytes during the hypoglyce-

mic state (Johnson et al., 1972). In the normal condition,

glucagon is synthesized and secreted in response to insuf-

ficient blood glucose levels

In healthy individuals, the glucagon binds to specific

receptors in the liver called the human glucagon receptor

(HGR). Binding of glucagon to HGR leads to and triggers

the synthesis of glucose (gluconeogenesis) and also pro-

cessing and releasing of hepatic glycogen stores (glyco-

genolysis) to restore blood glucose and maintain

homeostasis.

HGR is a G protein-coupled receptor binding of gluca-

gon to which stimulates cyclic AMP and Ca2? accumula-

tion as a result of adenylatecyclase activation (Trivedi

et al., 2000). In Type II diabetes, bihormonal hypothesis

implicates inappropriate secretion and activity of the two

major pancreatic hormones that control glucose homeo-

stasis, insulin, and glucagon (Unger and Orci, 1975).

In vivo research studies in various animal species imply

that neutralization of circulating glucagon alleviates

hyperglycemia (Brand et al., 1994). Hence, antagonists of

the HGR have the potential to modulate the rate of hepatic

M. Shahlaei

Novel Drug Delivery Research Center, School of Pharmacy,

Kermanshah University of Medical Sciences, Kermanshah, Iran

Z. Nazari

Student Research Committee, Kermanshah University of

Medical Sciences, Kermanshah, Iran

M. Shahlaei (&)

Department of Medicinal Chemistry, Faculty of Pharmacy,

Kermanshah University of Medical Sciences, Kermanshah, Iran

e-mail: [email protected]; [email protected]

123

Med Chem Res

DOI 10.1007/s00044-013-0869-9

MEDICINALCHEMISTRYRESEARCH

glucose output and improve insulin responsiveness in the

liver, resulting in a decrease in fasting plasma glucose

levels in diabetics (Chang et al., 2001).

Although several experimental procedures are available

for screening compounds for biological activity (e.g.,

in vivo and in vitro assay tests), all of those procedures,

however, have been performed using at least receptors and

other biological materials of human, rat, mouse, and calf

origin (Hill, 1972). These procedures are expensive, time-

consuming, and can potentially cause toxic by-products

from the experimental techniques employed currently. This

has implied that the development of computational proce-

dures as alternative tools for predicting the properties of

compounds has been a subject of intensive study.

Among computational procedures for drug design and

discovery, quantitative structure–activity relationship

(QSAR) has found various applications for predicting

chemical properties, including biological activity (Seiers-

tad and Agrafiotis, 2006), physical properties (Verma et al.,

2005), and toxicity (Khadikar et al., 2002).

QSAR models are common and rather successful

methods in drug design and computational discovery of the

new lead compounds.

QSAR models are essentially regression models in

which the independent variables are molecular descriptors

that explain the structure of molecules quantitatively, and

the dependent variable is the activity of interest usually

expressed as pIC50 (-log IC50) in the case of antagonist.

However, in cases of complex relationships, conventional

QSAR models (such as multiple linear regression) often

lead to insufficient or misleading information because of

nonlinear relationships within the studied dataset.

Moreover, for membrane-bound receptors (such as

HGR), the activity often results from both membrane

interaction and receptor binding, which may also lead to

nonlinear dependencies (Buyukbingol et al., 2007). One

possibility of overcoming the difficulties of such nonlin-

earities in QSAR research studies is the use of artificial

neural network (ANN) models, which has gained much

interest in the field of computational drug design (Valkova

et al., 2004; Arkan et al., 2010; Saghaie et al., 2010;

Shahlaei et al., 2010a, b, 2012a, 2013; Shahlaei and Fas-

sihi, 2012; Shahlaei, 2013). After a suitable learning step;

an ANN model should be able to ‘‘recognize’’ basic cor-

relations in a given dataset and to predict, for example,

pIC50 (Saghaie and Fassihi, 2012; Fassihi et al., 2012;

Shahlaei et al., 2012b). In this study, the principal com-

ponent analysis-based neural network analysis approach is

applied to build a QSAR model on a set of 35 substituted

imidazoles with known affinity for HGR. The goals of the

present study are to extract the relevant principal compo-

nent using quantum descriptors, to establish the QSAR of

the library of compounds, to establish the high predictive

ability of radial basis function neural network (RBFNN)

modeling on this library of the ligands, and to develop

insights regarding the relationship between the principal

component extracted from quantum descriptors of the

compounds of interest and their affinity for HGR; the

generated nonlinear ANN model is expected to be valuable

in the rational design of chemical modifications of HGR

antagonists to identify the most likely candidates for the

synthesis and discovery of new lead ligands.

Methods

Calculation of quantum descriptors

The biological data used in this study were glucagon receptor

inhibitory activities (in terms of -log IC50), of a set of 35 from

various compounds extracted from the research study by

Chang et al. (2001). The basic skeletons of the studied

compounds and details are summarized in Table 1. The

initial structures of all substituted imidazoles derivatives

were constructed using the software CS Chem3D (Ultra 10.0,

ChemOffice 2006, CambridgeSoft Corporation).

To save computational time, initial geometry optimiza-

tions were carried out by the molecular mechanics (MM)

method using the MM ? force fields. The resulting

geometries of all the 35 studied ligands were reoptimized

with density functional theory (DFT) method at the

B3LYP/6-31G level (Becke, 1993), and frequency calcu-

lations were performed at the same level for all of the

possible geometries to make sure that they are minimal on

the potential energy surface. DFT is a quantum mechanical

modeling technique based on the Hohenberg–Kohn theo-

rem (Hohenberg and Kohn, 1964) and the Kohn–Sham

method (Kohn and Sham, 1965) and is used to calculate the

ground-state electronic energy E0 and other ground-state

molecular properties from the ground-state electronic

density q0 instead of the electronic wave function. Since

the mid 1990s, the B3LYP level of theory (Lee et al., 1988;

Becke, 1993) has been the most extensively applied for

quantum calculations in molecules because of the accuracy

of the results obtained for a vast range of molecules, par-

ticularly organic molecules. All B3LYP level of theory

calculations were carried out by Gaussian 03 program

(Frisch et al., 2008).

As listed in Table 2, 18 quantum-chemical descriptors

were used to analyze their variations and efficiency of the

inhibition activity predictions of the compounds of interest.

The quantum descriptors employed in this study, such as

polarizability (a), dipole moment (l), energy of the highest

occupied molecular orbital (EHOMO), energy of the lowest

unoccupied molecular orbital (ELUMO), the most negative

atomic charge, the most positive charge, etc., have all been

Med Chem Res

123

Table 1 The main skeletons and details of structures used in this current study

N

NH

R1R3

R2

pIC50R3R2R1Compd

6.7958 NFBr1

6.886NF2

7.045 NF

Br

3

6.823 NF

Br

4

7.096 NFCl5

7.000 NFF6

7.000 NFI7

7.045 NFCH38

6.552 NF

CH3

CH3

9

6.522 NF10

7.154 NFNH211

7.096 NFOH12

7.000 NFOCH313

Med Chem Res

123

Table 1 continued

pIC50R3R2R1Compd

6.677 NFCN14

6.522 NFCO2CH315

6.958 NF

S Br16

6.721 NF

O Br17

6.468 NF18

6.920 NFH2CH2C19

7.397 NBr20

5.657577 Br21

7.000 OHBr22

7.698 N

H3C

FCl23

7.301 N

CH3

FCl24

7.638 NClCl25

6.853 NICl26

Med Chem Res

123

Table 1 continued

pIC50R3R2R1Compd

5.481 NCl27

6.229 NCl28

6.823 N

O

Cl29

6.657 N

O

Cl30

6.602 NOCl31

5.619 N

O CH2CH2CH3

O CH2CH2CH3

Cl32

N SCH3

F

Het

pIC50HetCompd

6.283

N

N

Me

33

6.065

N

NH

Me

34

6.031

N

O

35

Med Chem Res

123

obtained directly or indirectly (Table 2) from the Gaussian

output files.

The selection of input variables to ANN is essential to

avoid ‘‘over fitting’’ (Tetko et al., 1993) in the case of

many input descriptors offered. As a linear technique for

dimensionality reduction, PCA can transform the input

dataset from its original form (points in m-dimensional

space) to its new form (points in p-dimensional space),

where p is less than m. During the process, most of the

amounts of the variability of the original input dataset are

retained. Using the calibrated input dataset in a lower

dimension, smaller ANN is applied in the performance of

prediction.

Principal component analysis (PCA)

Next, a PCA was performed for variable reduction and data

interpretation. In PCA, descriptors describe the same

property clustering together, and hence, it is easy to

describe the predicted activity with less number of inde-

pendent variables.

In the PCA, at first, the data preprocessing must be

carried out on the descriptors calculated using mean cen-

tering and autoscaling. If k denotes the number of mole-

cules used in the regression, p the number of descriptors

which are calculated for each molecule, and yk,1 the matrix

of the activity (pIC50), then y is a vector of length k. and X

is a (k 9 p) matrix containing the calculated descriptors as

row vectors.

X ¼ CR ð1Þ

The matrix C and the eigenvalues [k] are given by

solving the eigenvalue problem:

CTZC ¼ ½k� ð2Þ

and the eigenvector matrix R (k 9 p) is calculated by

R ¼ CTX: ð3Þ

In Eq. (2) Z ¼ XXT denotes the (k 9 k) covariance

matrix, and [k] is the diagonal matrix of eigenvalues. The

rows in CT are the eigenvectors of Z, and its columns are

the ‘‘scores.’’ The column vectors of the square matrix

C are orthonormal and often called principal components

(PCs).

After generation of principal components, these scores

were used as new variables for regression.

Speaking in general, only p principal components are

enough to account for the most variances in an m-dimen-

sional dataset, where p is the number of important principal

components of the dataset, and m means the number of all

the principal components in the dataset. It is obvious that

p is less than m. Hence, PCA is generally regarded as a data

reduction technique. That is to say, a multidimensional

dataset can be projected to a lower dimension data space

without losing most of the information of the original

dataset by PCA.

Splitting PCs matrix into training and test sets

At the next step of developing QSAR models, and to

develop a reliable (validated) QSAR model, consecutive

molecules are selected and put alternatively in the training

and test sets. The division of an original dataset into the

training and test sets can be carried out using various

algorithms.

Ideally, this division must be carried out so that points

representing both training and test sets are distributed

within the whole descriptor space defined by the original

dataset, and each molecule of the test set is close to at least

one molecule of the training set. This method ensures that

the similarity principle can be used for the pIC50 prediction

of the test set. The Kennard–Stone (KS) algorithm (Ken-

nard and Stone, 1969) is well known in the field of data

splitting and has already found many uses in QSAR studies

(Arkan et al., 2010; Saghaie et al., 2010; Shahlaei et al.,

2010a, b).

Random sampling is a well-known method because of

its simplicity and also because a group of data randomly

extracted from a larger set follows the statistical distribu-

tion of the entire set. However, random sampling does

Table 2 Quantum electronic descriptors used in this study

Descriptor

abbreviation

Descriptor definition

EHOMO The energy of the highest occupied molecular

orbital

ELUMO The energy of the lowest occupied molecular

orbital

H–L The HOMO–LUMO energy gap

Electronegativity v ¼ ðEhomo�ElumoÞ2

Hardness g ¼ � ðEhomoþElumoÞ2

Electrophilicity x ¼ v2

2g

MPC The most positive charge

LNC The least negative charge

SSC The sum of square of charges

SSPC The sum of square of positive charges

SSNC The sum of square of negative charges

SPC The sum of positive charges

SNC The sum of negative charges

SAC The sum of absolute charge

DMx Dipole moment in x direction

DMy Dipole moment in y direction

DMz Dipole moment in z direction

TDM Total dipole moment

Med Chem Res

123

neither guarantee the representativeness of the set, nor does

it prevent extrapolation problems (Rajer-Kanduc et al.,

2003). In other words, random selection does not ensure

that the molecules on the boundaries of the set are included

in the training. An alternative to random selection tech-

nique, which is often used is the KS algorithm. KS algo-

rithm is aimed at spanning the multidimensional space in a

uniform manner by maximizing the Euclidean distances

between the descriptor vectors (x) of the selected molecules

(Wu et al., 1996).

In order to ensure a uniform distribution of such a

subset along the x (descriptors) data space, KS follows a

stepwise method in which new selections are taken in

regions of the space far from the molecules already

selected. Therefore, the method uses the Euclidean dis-

tances dx(p,q) between the x-vectors of each pair (p,q) of

molecules computed as

dXðp; qÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

X

J

j¼1

xpðjÞ � xqðjÞ� �2

;

v

u

u

t p; q 2 ½1; N�

For a typical QSAR, xp(j) and xq(j) are the descriptor

values at the jth descriptor for molecules p and q,

respectively. J denotes the number of descriptors in the

original dataset matrix. The selection procedure starts by

taking the pair (p1,p2) of molecules for which the Euclidean

distances dx(p1,p2) are the largest. In each subsequent

iteration, the method selects the molecule that has the

largest minimum Euclidean distance with respect to any

molecule already selected. Such a procedure is repeated until

the number of molecules specified by the user is achieved.

Radial basis function neural networks (RBFNNs)

In the present study, one type of neural networks, namely

RBFNN, was employed to establish an alternative nonlin-

ear model. The theory of RBFNN has been adequately

described in detail elsewhere (Xiang et al., 2002). Hence,

we will limit ourselves to a brief outline highlighting only

the most important aspects.

Usually, RBFNN comprises three layers: the input layer,

the hidden layer, and the output layer (Fig. 1). The input

layer does not process the information, since it only dis-

tributes the input vectors to the hidden layer, whereas the

latter consists of a number of RBF units (nh) and biases

(bk). Each neuron on the hidden layer employs a radial

basis function (RBF) as a nonlinear transformation function

to operate on the input data. The more frequently used RBF

is a Gaussian function that is characterized by a center and

a width. This function measures the Euclidean distance

between the input vector (X) and the center (cj) and per-

forms the nonlinear transformation within the hidden layer

as follows:

Hj ¼ expð� X � cj

�

�

�

�

�

�

2

=r2j Þ ð4Þ

denoting j the output of the jth RBF unit, while cj and rj are

the center and the width of such unit, respectively. The

operation of the output layer is linear and is given by:

ykðxÞ ¼X

nh

i¼1

WjHj þW0 ð5Þ

where yk is the kth output unit for the input vector X, Wj is

the weight connection between the kth output unit and the

jth hidden layer unit, and bk is the respective bias (Fig. 1).

From Eqs. (4) and (5), one can see that the design of

RBFNN involves selecting centers, number of hidden layer

units, widths, and weights. There are various methods for

selecting the centers, such as random subset selection,

k-means clustering, and RBF–PLS. In this study, a forward

subset selection routine was used to select the centers from

the training set samples. As regards the widths of the radial

basis functions, they can either be chosen equal for all the

units or different for each unit. Here, we limited ourselves to

Gaussian functions with a constant width for all of the units.

Furthermore, the adjustment of the connection weight

between the hidden layer and output layer was performed

using a least-squares solution after the selection of the RBF

centers and width. All RBFNN calculations were performed

using home-developed scripts using the MATLAB package

(www.mathworks.com/products/matlab/).

Moreover, the overall performance of the final RBFNN

model was evaluated in terms of its root mean squared

error (RMS), and its goodness and robustness estimated by

the same statistical parameters as those used for the linear

model Fig. 2.

Fig. 1 A typical Radial Basis Function network architecture with

n and j neurons in the input (X) and hidden (H) layers, respectively,

and a single neuron in the output layer (Y). An activation function

(drawn inside the circles) depicts what happens at a given neuron.

Wis represent the weights

Med Chem Res

123

http://www.mathworks.com/products/matlab/

Validation and evaluation

Testing the stability, predictive power, and generalization

ability of the models is a very important step in QSAR

study. There are several tools to estimate and calculate the

accuracy and also the validity of the proposed QSAR

model as well the impacts of the preprocessing steps that

can be categorized into two sets: internal validation, and

external validation.

Some of the common parameters used for checking the

predictability of proposed models are root mean square

error (RMSE), square of the correlation coefficient (R2),

and a predictive residual error sum of squares (PRESS).

These parameters were calculated for each model as

follows:

RMSE ¼ 1=nX

n

i¼1

ðy� yiÞ2" #2

ð6Þ

R2 ¼X

n

i¼1

ðyi � �yÞ2,

X

n

i¼1

ðyi � �yiÞ ð7Þ

PRESS ¼X

n

i¼1

ðyi � yiÞ2 ð8Þ

where yi is the true bioactivity of the investigated com-

pound i,yi represents the calculated bioactivity of the

compound i, �xis the mean of true activity in the studied set,

and n is the total number of molecules used in the studied

sets.

The cross validation is one of the most popular methods

for internal validation. In this study, the internal predictive

capability of the model was evaluated by leave-one-out

cross-validation (Q2LOO). A good Q2LOO often indicates

a good robustness and high internal predictive power of a

QSAR model. However, recent studies of Tropsha et al.

(2003) indicate that there is no evident correlation between

the value of Q2LOO and actual predictive power of a

QSAR model, revealing that the Q2LOO is still inadequate

for a reliable estimates of model’s predictive ability for all

new compounds. In order to determine both the general-

izability of QSAR models for new compounds and the true

predictive ability of the models, the statistical external

validation can be used at the model development step by

properly employing a prediction set for validation. The

results of data splitting using KS algorithm are shown in

Table 1, as the test set is indicated with an asterisk.

Results and discussion

Many molecular properties depend on intermolecular

interactions. The main component of these interactions is

electrostatic in its nature. Electrical charges in the molecule

are simply the driving force of electrostatic interactions.

Charge-based descriptors have therefore widely been used

as chemical reactivity indices or as measures of weak

intermolecular interactions. The charge distributions in a

given molecule and partial charges on the atoms can be

estimated using quantum-chemical calculations. One of the

most important parts of standard output of almost any

quantum calculation is the Mulliken atomic charges

(Mulliken 1955a, b, c, d,). Usually, the minimum (most

negative) and maximum (most positive) atomic partial

charges in the molecule or the minimum or maximum

partial charges for particular types of atoms are employed

as quantum descriptors (Clare and Supuran, 1994; Cartier

and Rivail 1987). Different sums of absolute or square

values of partial charges (e.g., sum of positive charges and

sum of square of positive charges) have also been

employed to explain intermolecular interactions. Other

Fig. 2 scores’ plots on the first three principal components of PCA

Med Chem Res

123

usual charge-based quantum-chemical indices used as

electrostatic descriptors in QSAR models are the average

absolute atomic charge (Clare and Supuran, 1994; Ordorica

et al., 1993) and a polarity parameter defined as the dif-

ference between the values of the most positive and neg-

ative charges (Clare and Supuran, 1994; Cartier and Rivail,

1987; Clare, 1995).

Electrostatic interactions can also be denoted by the

respective electrical moments and their components. The

polarity is denoted by the dipole moment (l). The polariza-

tion of a molecule by an external electric field can be defined

in terms of nth order susceptibility tensors of the molecule

(Sotomatsu et al., 1989). The first-order term that is referred

to as the polarizability of the molecule represents the relative

susceptibility of the electron cloud of an atom or a molecule

to be distorted from its normal shape by the presence of an

external field. Due to this distortion, an induced electric

dipole moment appears. Polarizability (a) is a tensor relating

the induced dipole moment (l ind) to the applied electric

field strength. The nondiagonal elements of the tensor rep-

resent the polarizability of the electrons along one of the axes

of the coordinate system due to a component of the applied

electric field along other coordinate axes. As this effect is

insignificant compared to the polarizability in the direction

of the applied electric field, the nondiagonal elements of the

polarizability tensor are zero or very small compared with

the diagonal elements. The polarizability is therefore rep-

resented in practice as ‘‘mean polarizability,’’ i.e., the aver-

age polarizability over the three axes of the molecule, and

equals one-third of the trace. It has been shown that a is

related to the molecular volume (Lewis et al., 1994),

hydrophobicity (Breneman and Rhem, 1997), and the elec-

trophilic superdelocalizability (Clare and Supuran, 1998).

According to classical chemical theory, all chemical

interactions are by nature either electrostatic (polar) or

orbital (covalent) driven. In quantum chemistry, covalent

interactions arise from orbital overlap. The interaction of

two orbitals depends on their energy eigenvalues. Conse-

quently, energies associated with the highest occupied

molecular orbital (EHOMO) and the lowest unoccupied

molecular orbital (ELUMO) are often good candidates for

2D descriptors. For example, EHOMO might model the

covalent basicity of a hydrogen bond acceptor or the

ELUMO might model the covalent acidity of the proton of a

H bond donor. Further interpretation is possible because

the HOMO energy is related to the ionization potential and

is a measure of the molecule’s tendency to be attacked by

electrophiles. Correspondingly, the LUMO energy is rela-

ted to the electron affinity and is a measure of a molecule’s

tendency to be attacked by nucleophiles (Tuppurainen

et al., 1991). Furthermore, according to frontier molecular

orbital theory, transition state formation involves the

interaction between the frontier orbitals of reacting species.

The HOMO–LUMO gap, i.e., the difference between the

EHOMO and the ELUMO is an important stability index

(Lewis et al., 1994).

A large HOMO–LUMO gap implies high stability for

the molecule in terms of its lower reactivity in chemical

reactions. The concept of chemical hardness has been

derived from the basis of the HOMO–LUMO energy gap

(Klopman and Iroff, 2004).

Eighteen quantum descriptors were calculated for each

of the studied molecules. All the descriptors representing

the electrostatic potential, together with all frontier orbital

descriptors used in this study are listed in Table 2.

In order to get the linear relationship with independent

variables, logarithms of the inverse of biological activity

(Log 1/IC50) data of 35 molecules were used.

PCA is a multivariate technique in QSAR that analyzes a

data matrix in which molecules are described by several

intercorrelated quantitative-dependent descriptors. Its goal

is to extract the important information from the matrix to

represent it as a set of new orthogonal variables called

principal components, and to display the pattern of similarity

of the observations and of the variables as points in maps.

PCA was performed on the calculated quantum

descriptors. All the calculated PCs with their eigenvalues

are shown in the Table 3. In this table, the eigenvalues, the

percentage variances explained by each eigenvalue and the

cumulative percentage variances are represented. There-

fore, we limited the further studies to PCs and selection of

Table 3 Percentage variance by the PCA analysis carried out on the

original matrix of the quantum descriptors

PC no. Eigenvalue Variance (%) Cumulative

variance (%)

1 8.081 44.893 44.893

2 4.409 24.494 69.386

3 1.567 8.706 78.092

4 1.394 7.745 85.838

5 0.720 4.001 89.839

6 0.632 3.509 93.348

7 0.521 2.893 96.241

8 0.430 2.390 98.631

9 0.195 1.086 99.717

10 0.034 0.188 99.905

11 0.012 0.066 99.971

12 0.003 0.016 99.987

13 0.001 0.007 99.994

14 0.001 0.005 99.998

15 0.000 0.002 100.000

16 0.000 0.000 100.000

17 0.000 0.000 100.000

18 0.000 0.000 100.000

Med Chem Res

123

the best subset of these PCs to perform linear and nonlinear

regression methods.

PCA, despite outstanding properties, is known to have

some inadequacies. One such inadequacy is that it is

strongly influenced by the presence of outliers. In QSAR

studies, the outliers are the molecules exhibiting very dif-

ferent values for some of the calculated descriptors in

comparison with the majority of molecules. Hence, the

obtained PCs will not explain the majority of the data well,

and one cannot get a suitable insight into the data structure.

A method to deal with this problem is to remove the out-

lying molecules detected on the score plots and to repeat

the PCA procedure. The first three PCs accounted for

93.3 % of the total variances in PCA, and the sample

scores for all the 203 samples on the first two PCs are

shown in Fig. 5—obvious outliers.

Table 3 shows the eigenvalues and percentage variance

for the principal components extracted on the original

quantum data matrix. As seen in this table, the largest part

of information has been reduced to the first principal

component (44.89 %). The second and third principal

components comprise a lower amount of information

(24.49 and 8.71 %, respectively), by summing the overall

information content of 100 %. The minimal eigenvalue

was set to 0.15. The information content of the tenth

eigenvalue was 0.188 (negligible information content

retained), corresponding to an eigenvalue of 0.034 which

was less than the chosen limit, while that of the ninth

eigenvalue was 1.086 %, corresponding to an eigenvalue of

0.195, greater than the chosen limit. This means that the

remaining variance in the PCs after nine PCs represents

only noise in the primary data matrix, and the most sig-

nificant information content is contained in the first nine

PCs.

After dividing the molecules into two parts using PCs as

variables, calibration, and validation sets, based on KS

algorithm, building of regression model using training set

was carried out.

In model formation step, a RBFNN was the built model

to create a relationship between PCs and the pIC50. This

model is called PCA-RBF-ANN henceforth.

The input of the network was the eigenvalue-ranked

PCs, the number of which to enter neural network varied

from 1 to 18, 9 of these PCs were selected as input of

networks. Using this number of PCs gave the best results

on the basis of the lowest root mean square error for

training set (RMSET) and root mean square error of cross-

validation (RMSECV) in the output of network (Fig. 3).

For the PCA-RBF-ANN model, the ‘‘spread’’ and the

number of the RBFs (the hidden layer units) are the two

Fig. 4 The optimization of

number of nodes in hidden layer

and spread value for radial basis

function neural network using

RMSECV

Fig. 3 Optimization of number of PCs used in ANN model

Med Chem Res

123

important parameters influencing the performances of the

network developed. The selection of the optimal width

value for RBF-ANN was carried out by systemically

changing its value in the training step. The values which

gave the best RMSE LOO cross-validation result were used

in the model. Each minimum error on LOO cross-valida-

tion was plotted against the width (Fig. 4), and the mini-

mum was chosen as the optimal condition. Finally, the

number of the hidden layer units was 18, and the optimal

spread was 1.7.

Based on the above optimization, a 9-18-1 ANN-RBF

model was finally constructed. The predicted data of the

RBF-ANN model are shown in Table 4 and the plot of the

predicted and experimental values of both training and test

sets are shown in Fig. 5.

Developed QSAR models can be validated by different

theoretical tools. These theoretical validation tools can be

generally categorized into two categories: internal valida-

tion (which does not use any external source of data for

verification of predictability of generated models), and

external validation tools (which employs, for validation, a

new set of data not applied in the model fitting exercise).

One of the most important methods in the internal valida-

tion tools includes leave-one out cross-validation (LOO-

CV).

The statistical parameters obtained by generated model

for the training and prediction sets are shown in Table 5.

The GA-ANFIS has the RMSE of 0.268 for the training set

and 0.362 for the test set. The squared correlation coeffi-

cient (R2) of the training set is 0.916, and of the test set is

0.932.

The usefulness of QSAR models is their capability not

only to regenerate known data, but also to generate a good

estimation of bioactivity for any external object (Gramatica

and Papa, 2003). The predictabilities of the generated

PCA–RBF–ANN model are severely affected by the over-

fitting problem. In QSAR modeling, over-fitting is acquired

when uninformative independent variables enter the model.

There are some techniques to approximate the quality of

the developed models (Golbraikh and Tropsha, 2002;

Gramatica et al., 2007). Cross-validation is the most

commonly employed validation technique (Zhang and

Fig. 5 The plot of the predicted and experimental values of both

training and test sets

Table 4 Experimental and predicted activities for studied molecules

Molecule No. Y Ypred REP

1 6.80 6.91 -1.62E-02

2 6.89 6.89 -9.63E-06

3 7.05 7.04 5.40E-07

4 6.82 6.82 -5.24E-05

5 7.10 7.10 1.13E-04

6 7.00 7.00 2.35E-04

7 7.00 7.00 1.63E-04

8 7.05 7.05 -3.15E-05

9 6.55 6.55 -3.36E-04

10 6.52 6.52 1.58E-05

11 7.15 7.15 -6.64E-06

12 7.10 7.10 4.31E-06

13 7.00 7.00 6.59E-06

14 6.68 6.68 -2.42E-06

15 6.52 6.52 2.26E-06

16 6.96 6.96 -6.90E-06

17 6.72 7.10 -5.35E-02

18 6.47 6.47 -3.20E-07

19 6.92 6.65 4.11E-02

20 7.40 7.40 -4.51E-06

21 5.66 5.66 -1.53E-06

22 7.00 7.00 4.25E-06

23 7.70 7.70 3.28E-05

24 7.30 7.30 2.42E-05

25 7.64 7.12 7.34E-02

26 6.85 6.85 1.14E-05

27 5.48 6.17 -1.12E-01

28 6.23 6.23 6.53E-05

29 6.82 6.83 -3.14E-04

30 6.66 6.66 1.10E-05

31 6.60 6.60 3.12E-04

32 5.62 5.62 5.74E-06

33 6.28 6.28 2.18E-05

34 6.07 6.06 3.20E-06

35 6.03 6.03 4.56E-05

Y experimental activity

Y calculated activity using model

REP relative error of prediction

Med Chem Res

123

Tropsha, 2000). Therefore, to examine the predictability

and over-fitting of the generating model, the LOO-CV

procedure was used. The squared correlation coefficient for

cross-validation (Q2) was then estimated by the following

equation Q2 = 1-(PRESS-SSD), where PRESS and SSD

are the predicted residual sum of squares and the sum of the

squared deviations from the mean, respectively. The result

of LOO-CV procedure for training the subset of investi-

gated compounds is reported in the Table 5. The cross-

validation results show that the generated model has Q2

value greater than 0.732; therefore, the developed model is

a reasonable QSAR model.

Conclusion

In this study, DFT-derived quantum-chemical descriptors

such as local charges and electrostatic potentials of atoms,

dipole moments, and HOMO and LUMO energies in

combination with principal component analysis and

RBFNN were investigated for building quantitative struc-

ture–activity relationship for the prediction of glucagon

receptor inhibitory activity of a series of 35 imidazole

derivatives.

A proper model with high statistical quality and low

prediction errors was obtained. The predictive power of the

developed model was further demonstrated using an

internal validation method by dividing the 35 studied

compounds into subgroups as training and test sets. This

validation method also confirmed the predictive power of

the generated RBFNN.

The results for nonlinear model reflect a nonlinear

relationship between the principal components obtained

from quantum-electronic molecular descriptors and the

glucagon receptor-inhibitory activity for the studied set of

molecules.

With the rapid advancement of computer hardware and

software and DFT method, electronic structure theory had

become an efficient and routine tool in explaining the

structural features of an extensive range of molecules.

Therefore, in the present study, we endeavored to extend

the application field of traditional QSAR analysis by

combining the electronic structure theory calculation with

RBFNN.

Acknowledgments The authors are grateful to the Vice Chancellor

for Research and Technology, Kermanshah University of Medical

Sciences for the financial support. This article resulted from the

Pharm. D thesis of Zohreh Nazari, Major of Pharmacy, Kermanshah

University of Medical Sciences, Kermanshah, Iran.

References

Arkan E, Shahlaei M, Pourhossein A, Fakhri K, Fassihi A (2010)

Validated QSAR analysis of some diaryl substituted pyrazoles as

CCR2 inhibitors by various linear and nonlinear multivariate

chemometrics methods. Eur J Med Chem 45:3394–3406

Becke AD (1993) Density-functional thermochemistry. III. The role

of exact exchange. J Chem Phys 98:5648

Brand C, Rolin B, Jørgensen P, Svendsen I, Kristensen J, Holst J

(1994) Immunoneutralization of endogenous glucagon with

monoclonal glucagon antibody normalizes hyperglycaemia in

moderately streptozotocin-diabetic rats. Diabetologia

37:985–993

Breneman CM, Rhem M (1997) QSPR analysis of HPLC column

capacity factors for a set of high-energy materials using

electronic van der waals surface property descriptors computed

by transferable atom equivalent method. J Comput Chem

18:182–197

Burcelin R, Katz E, Charron M (1996) Molecular and cellular aspects

of the glucagon receptor: role in diabetes and metabolism.

Diabetes Metab 22:373–396

Buyukbingol E, Sisman A, Akyildiz M, Alparslan FN, Adejare A

(2007) Adaptive neuro-fuzzy inference system (ANFIS): a new

approach to predictive modeling in QSAR applications: a study

of neuro-fuzzy modeling of PCP-based NMDA receptor antag-

onists. Bio Org Med Chem 15:4265–4282

Cartier A, Rivail J-L (1987) Electronic descriptors in quantitative

structure—activity relationships. Chemom Intell Lab Syst

1(4):335–347

Chang LL, Sidler KL, Cascieri MA, de Laszlo S, Koch G, Li B,

MacCoss M, Mantlo N, O’Keefe S, Pang M (2001) Substituted

imidazoles as glucagon receptor antagonists. Bio Org Med Chem

Lett 11:2549–2553

Clare BW (1995) Structure-activity correlations for psychotomimet-

ics. III. Tryptamines. Aust J Chem 48:1385–1400

Clare BW, Supuran CT (1994) Carbonic anhydrase activators. 3:

structure-activity correlations for a series of isozyme II activa-

tors. J Pharm Sci 83:768–773

Clare BW, Supuran CT (1998) Semi-empirical atomic charges and

dipole moments in hypervalent sulfonamide molecules: descrip-

tors in QSAR studies. J Mol Struct Theochem 428:109–121

Fassihi A, Shahlaei M, Moeinifard B, Sabet R (2012) QSAR study of

anthranilic acid sulfonamides as methionine aminopeptidase-2

inhibitors. Monatshefte fur Chemie-Chemical Monthly

143:189–198

Table 5 Statistical parameters obtained for the ANN model

Parameter GA-MLR

Dataset Training set Test set

N 28 7

R2 0.869 0.994

RMSE 0.186 0.042

PRESS 0.970 0.012

Q2 0.732 –

RMSECV 0.346 –

PRESSCV 2.157 –

N number of molecules in dataset, R2 correlation coefficient of

experimental and predicted activities, RMSE root mean square error,

PRESS Predicted error sum of square, Q2 correlation coefficient of

leave-one out cross-validation, RMSECV root mean square error of

cross-validation, PRESSCV predictive residual sum of square of cross-

validation

Med Chem Res

123

Frisch M, Trucks G, Schlegel H, Scuseria G, Robb M, Cheeseman J,

Montgomery J, Vreven T, Kudin K, Burant J (2008) Gaussian

03, revision C. 02

Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model

20(4):269–276

Gramatica P, Papa E (2003) QSAR modeling of bioconcentration

factor by theoretical molecular descriptors. QSAR Comb Sci

22:374–385

Gramatica P, Giani E, Papa E (2007) Statistical external validation

and consensus modeling: a QSPR case study for Koc prediction.

J Mol Graph Model 25:755–766

Hill DL (1972) The biochemistry and physiology of tetrahymena, vol

230. Academic Press, New York

Hohenberg P, Kohn W (1964) Inhomogeneous electron gas. Phys Rev

136:B864

Johnson MEM, Das NM, Butcher FR, Fain JN (1972) The regulation

of gluconeogenesis in isolated rat liver cells by glucagon,

insulin, dibutyryl cyclic adenosinemonophosphate, and fatty

acids. J Biol Chem 247:3229–3235

Kennard R, Stone L (1969) Computer aided design of experiments.

Technometrics 11:137–148

Khadikar PV, Phadnis A, Shrivastava A (2002) QSAR study on

toxicity to aqueous organisms using the PI index. Bio Org Med

Chem 10:1181–1188

Klopman G, Iroff LD (2004) Calculation of partition coefficients by

the charge density method. J Comput Chem 2:157–160

Kohn W, Sham LJ (1965) Self-consistent equations including

exchange and correlation effects. Phys Rev 140:A1133–A1138

Lee C, Yang W, Parr RG (1988) Development of the Colle–Salvetti

correlation-energy formula into a functional of the electron

density. Phys Rev B 37:785–789

Lewis D, Lake B, Ioannides C, Parke D (1994) Inhibition of rat

hepatic aryl hydrocarbon hydroxylase activity by a series of

7-hydroxy coumarins: QSAR studies. Xenobiotica 24:829–

838

Mulliken R (1955a) Electronic population analysis on LCAO–MO

molecular wave functions. III. Effects of hybridization on

overlap and gross AO populations. J Chem Phys 23:2338–2342

Mulliken R (1955b) Electronic population analysis on LCAO-MO

molecular wave functions. IV. Bonding and antibonding in

LCAO and valence-bond theories. J Chem Phys 23:2343

Mulliken R (1955c) Electronic population analysis on LCAO [single

bond] MO molecular wave functions. II. Overlap populations,

bond orders, and covalent bond energies. J Chem Phys 23:1841

Mulliken RS (1955d) Electronic population analysis on LCAO MO

molecular wave functions. I. J Chem Phys 23:1833–1840

Ordorica M, Velazquez M, Ordorica J, Escobar J, Lehmann P (1993)

A principal component and cluster significance analysis of the

antiparasitic potency of praziquantel and some analogues. Quant

Struct-Act Relat 12:246–250

Rajer-Kanduc K, Zupan J, Majcen N (2003) Separation of data on the

training and test set for modelling: a case study for modelling of

five colour properties of a white pigment. Chemom Intell Lab

Syst 65:221–229

Saghaie MS, A Fassihi L (2012) Quantitative structure activities

relationships of some 2-mercaptoimidazoles as CCR2 inhibitors

using genetic algorithm-artificial neural networks. Res Pharm

Sci 8:97–112

Saghaie L, Shahlaei M, Madadkar-Sobhani A, Fassihi A (2010)

Application of partial least squares and radial basis function

neural networks in multivariate imaging analysis-quantitative

structure activity relationship: study of cyclin dependent kinase 4

inhibitors. J Mol Graph Model 29:518–528

Seierstad M, Agrafiotis DK (2006) A QSAR model of hERG binding

using a large, diverse, and internally consistent training set.

Chem Biol Drug Des 67:284–296

Shahlaei M (2013) Descriptor selection methods in quantitative

structure–activity relationship studies: a review study. Chem Rev

113(10):8093–8103

Shahlaei M, Fassihi A (2012) QSAR analysis of some 1-(3,

3-diphenylpropyl)-piperidinyl amides and ureas as CCR5 inhib-

itors using genetic algorithm-least square support vector

machine. Med Chem Res 22:4384–4400

Shahlaei M, Fassihi A, Saghaie L (2010a) Application of PC-ANN

and PC-LS-SVM in QSAR of CCR1 antagonist compounds: a

comparative study. Eur J Med Chem 45:1572–1582

Shahlaei M, Sabet R, Ziari MB, Moeinifard B, Fassihi A, Karbakhsh

R (2010b) QSAR study of anthranilic acid sulfonamides as

inhibitors of methionine aminopeptidase-2 using LS-SVM and

GRNN based on principal components. Eur J Med Chem

45:4499–4508

Shahlaei M, Madadkar-Sobhani A, Fassihi A, Saghaie L, Arkan E

(2012a) QSAR study of some CCR5 antagonists as anti-HIV

agents using radial basis function neural network and general

regression neural network on the basis of principal components.

Med Chem Res 21:3246–3262

Shahlaei M, Madadkar-Sobhani A, Fassihi A, Saghaie L, Shamshirian

D, Sakhi H (2012b) Comparative quantitative structure–activity

relationship study of some 1-aminocyclopentyl-3-carboxyamides

as CCR2 inhibitors using stepwise MLR, FA-MLR, and GA-

PLS. Med Chem Res 21(1):100–115

Shahlaei M, Fassihi A, Saghaie L, Arkan E, Madadkar-Sobhani A,

Pourhossein A (2013) Computational evaluation of some indenopy-

razole derivatives as anticancer compounds; application of QSAR

and docking methodologies. J Enz Inhib Med Chem 28:16–32

Sotomatsu T, Murata Y, Fujita T (1989) Correlation analysis of

substituent effects on the acidity of benzoic acids by the AM1

method. J Comput Chem 10:94–98

Tetko I, Luik A, Poda G (1993) Applications of neural networks in

structure-activity relationships of a small number of molecules.

J Med Chem 36:811–814

Trivedi D, Lin Y, Ahn J-M, Siegel M, Mollova NN, Schram KH,

Hruby VJ (2000) Design and synthesis of conformationally

constrained glucagon analogues. J Med Chem 43:1714–1722

Tropsha A, Gramatica P, Gombar V (2003) The importance of being

earnest: validation is the absolute essential for successful

application and interpretation of QSPR models. QSAR Comb

Sci 22:69–77

Tuppurainen K, Lotjonen S, Laatikainen R, Vartiainen T, Maran U,

Strandberg M, Tamm T (1991) About the mutagenicity of chlo-

rine-substituted furanones and halopropenals. A QSAR study

using molecular orbital indices. Mutat Res 247:97–102

Unger R, Orci L (1975) The essential role of glucagon in the

pathogenesis of diabetes mellitus. Lancet 305:14–16

Valkova I, Vracko M, Basak SC (2004) Modeling of structure–

mutagenicity relationships: counter propagation neural network

approach using calculated structural descriptors. Anal Chim Acta

509:179–186

Verma RP, Kurup A, Hansch C (2005) On the role of polarizability in

QSAR. Bio Org Med Chem 13:237–255

Wu W, Walczak B, Massart D, Heuerding S, Erni F, Last I, Prebble K

(1996) Artificial neural networks in classification of NIR spectral

data: design of the training set. Chemom Intell Lab Syst

33:35–46

Xiang Y, Liu M, Zhang X, Zhang R, Hu Z, Fan B, Doucet J, Panaye A

(2002) Quantitative prediction of liquid chromatography reten-

tion of N-benzylideneanilines based on quantum chemical

parameters and radial basis function neural network. J Chem

Inf Comput Sci 42:592–597

Zhang W, Tropsha A (2000) Novel variable selection quantitative

structure–property relationship approach based on the k-nearest-

neighbor principle. J Chem Inf Comput Sci 40:185–194

Med Chem Res

123

prediction of glucagon receptor antagonist activities of some substituted imidazoles using combined...

Documents