hierarchy neural networks as applied to pharmaceutical problems

Advanced Drug Delivery Reviews 55 (2003) 1119–1147www.elsevier.com/ locate/addr

H ierarchy neural networks as applied to pharmaceuticalproblems

*Hiroshi Ichikawa

Hoshi University School of Pharmacy, Department of Information Science, Ebara 2-4-41 Shinagawa, Tokyo, 142-8501, Japan

Received 3 February 2003; accepted 12 May 2003

Abstract

Optimization and prediction are the main purposes in pharmaceutical application of the artificial neural networks (ANNs).To this end, hierarchy-type networks with the backpropagation learning method are most frequently used. This articlereviews the basic operating characteristics of such networks. ANNs have outstanding abilities in both classification andfitting. The operation is basically carried out in a nonlinear manner. The nonlinearity brings forth merits as well as a smallnumber of demerits. The reasons for the demerits are analyzed and their remedies are indicated. The mathematicalrelationships of ANN’s operation and the ALS method as well as the multiregression analysis are reviewed. ANN can beregarded as a function that transforms an input vector to another (output) one. We examined the analytical formula for thepartial derivative of this function with respect to the elements of the input vector. This is a powerful means to know therelationship between the input and the output. The reconstruction-learning method determines the minimum number ofnecessary neurons of the network and is useful to find the necessary descriptors or to trace the flow of information from theinput to the output. Finally, the descriptor-mapping method was reviewed to find the nonlinear relationships between theoutput intensity and descriptors. 2003 Elsevier B.V. All rights reserved.

Keywords: Artificial neural network; Basic theory of operation; Hierarchy; Backpropagation, Reconstruction; Forgetting; Descriptormapping; Partial derivative; Correlation between input and output

Contents

1 . Introduction ............................................................................................................................................................................ 11202 . Simulation of the nerve system................................................................................................................................................. 1122

2 .1. Biological nerve system in essence .................................................................................................................................... 11222 .2. Biological neuron and artificial neuron .............................................................................................................................. 11222 .3. Operation of artificial neuron ............................................................................................................................................ 1123

3 . Basic theory of hierarchy-type neural network ........................................................................................................................... 11243 .1. Network for classification ................................................................................................................................................. 11243 .2. Network for fitting ........................................................................................................................................................... 11253 .3. Training .......................................................................................................................................................................... 1125

*Tel. / fax: 181-3-5498-5761.E-mail address: [email protected](H. Ichikawa).

0169-409X/03/$ – see front matter 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0169-409X(03)00115-7

mailto:[email protected]

1120 H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147

4 . Characteristics and/or problems of the neural network’s operation.............................................................................................. 11264 .1. Basic operating characteristics .......................................................................................................................................... 11264 .2. Some difficulties .............................................................................................................................................................. 1127

5 . Relationship of operation between ANNs and conventional methods........................................................................................... 11285 .1. The ALS method ............................................................................................................................................................. 11285 .2. The multiregression analysis ............................................................................................................................................. 1129

6 . How to overcome defects of ANN’s operation and purposive extension ...................................................................................... 11296 .1. How to deal with excessive nonlinear operation ................................................................................................................. 11296 .2. How to obtain partial differential coefficients by the neural network .................................................................................... 1130

6 .2.1. Accuracy of the partial derivatives obtained by the neural network ............................................................................ 11306 .2.2. Is the independency between input neurons kept? ..................................................................................................... 11316 .2.3. Isolation of functions out of the mixed functions ...................................................................................................... 11326 .2.4. Recognition of two similar functions ....................................................................................................................... 11346 .2.5. Recognition two similar functions by correlated input data ........................................................................................ 11356 .2.6. Application the partial derivative method ................................................................................................................. 1135

6 .3. Reconstruction learning .................................................................................................................................................... 11376 .3.1. Introduction of the forgetting procedure into the learning phase ................................................................................. 11376 .3.2. How does reconstruction learning work?.................................................................................................................. 1138

136 .3.3. Practical application: the relationship between C NMR chemical shift and the conformation of norbornane andnorbornene ............................................................................................................................................................ 1139

6 .4. Descriptor mapping .......................................................................................................................................................... 11396 .4.1. Method.................................................................................................................................................................. 11406 .4.2. Examination using mathematical functionsw 5 x 1 2y 13z ............................................................................11406 .4.3. Application to SAR analysis ................................................................................................................................... 1144

7 . Concluding remarks ................................................................................................................................................................ 1145Acknowledgements...................................................................................................................................................................... 1145References .................................................................................................................................................................................. 1146

1 . Introduction matical function and the thickness of fibers isexpressed as a weight value between neurons. If one

It could have an incredible influence on many teaches, the simulated nerve systems learn andfields when a fundamental phenomenon comes to behave as if they were a kind of brain of livinglight. The question of information processing in things and are called artificial neural networksliving things has been the object of such research. (ANNs).The functioning mechanism of the nerve system of ANNs may have two types of connection, i.e. theliving things had gradually emerged by the early hierarchy connection and mutual connection al-1940s. If one removes additional or decorative though the former is regarded as a special case of thefactors from the nerve system, the very central latter (Fig. 1). In the hierarchy network, the signalsfactors are reduced to neurons and nerve fibers that (information) proceed from the input to the outputconnect neurons. The physiology of such a nerve without feeding back to the passed-through neurons.system reveals that the neuron transmits discrete The mutual-connection type neural network, whichinformation as an action potential when a certain allows such feedback, has developed to the Hopfield-amount of information is accumulated on the cell type networks[3,4] and, incorporating the idea ofmembrane. It is also understood that the functioning statistical mechanics, turned out to be the Boltzman-of memory and cognition is carried out as ‘thickness’ n-machine type networks[5].of fibers between neurons. The nerve system is an Turning to the method of learning, supervisedassembly of such neurons with nerve fibers[1]. learning and unsupervised learning systems may be

Since the nerve system is easily simulated in a used (Fig. 2). To achieve learning, a standard ofdigital computer, the behavior of variously arranged evaluation is necessary. The results of evaluationand connected neurons has been studied[2], where must be fed back to change the thickness of fibers,the functioning of a neuron is replaced by a mathe- i.e. the weight values, to adapt the output to the

H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147 1121

Fig. 1. Hierarchy connection-type neurons (a) and mutual connection-type neurons (b). Here, arrows indicate the flow of information.

standard. The supervised learning is given such a application in the fields of pharmaceutical andstandard from the outside of the network system. As related sciences. First, application of ANNs in thesea supervised learning method, the backpropagation fields may be traced back to 1989, where a generalmethod [2] has been established on the hierarchy treatment in the pharmaceutical field was discussedneural networks, which have been most frequently [11]. The first application in QSAR is seen in 1990applied to pharmaceutical problems. [12]. Since Devillers has reviewed the early applica-

The unsupervised learning is based only on the tion of ANNs to drug-related fields[13], here, weinternal structure of the network. In the Hopfield- quote review articles from 1994 to 2001.type and Boltzmann-machine network structures, the Since the first successful application to the QSARunsupervised learning method has been adopted. We study[11], a large number of articles has appeared.do not intend to go further into a detailed explanation They are continuously reviewed[14–18]. ANN isof these structures but, here, recommended textbooks also a study object of clinical and biomedicalare cited[2,6]. One thing to add is: since a molecule application, and review articles are given in Refs.has a 3D geometry, 3D information is conveniently [19–24]. Traditional Chinese medicine requires fullreduced to a lower dimension to analyze molecular experience and a delicate balance of herbal com-properties. For this purpose, Hopfield-type, ponents. Application to this end is considerable andBoltzmann machine, or Kohonen-type networks[7,8] has been successfully carried out as seen in Refs.have been studied for application and are discussed/[25–28],although they are not review articles. ANNreviewed by Zupan and Gasteiger and co-workers seems to be a new modeling method that has not[6,9] and Doucet and Panaye[10]. Here, the been broadly applied to pharmaceutical sciences; aKohonen-type network may be regarded as a combi- number of articles have been published and severalnation of one-layer networks with hierarchy neurons. groups have already reviewed them[22,28–31].

Now let us give a brief bibliography of ANN As mentioned, the hierarchy-type neural network

Fig. 2. Supervised learning (a) and unsupervised learning (b).


with the backpropagation learning procedure has chemical transmitters: the former transmits the po-been mostly used in the medical and pharmaceutical tential to other somas in a regular way but the latter,areas. This is because classification and fitting are in a reverse way. Thus, a neuron may receive thethe most commonly used technique for rationaliza- signal from other neurons as the sum of thosetion and prediction in these fields. But in most cases, positive and negative signals. Without a signal fromANN with the backpropagation algorism has been other neurons, the neuron has a standing electricapplied without rationalization of its operation. Thus, potential (ca.250 mV). But it gives a pulse of athis article treats the hierarchy-type neural network high potential (ca.170 mV) when it receives morein an attempt to explain its proper application, where than a certain amount of information and sends itsspecial emphasis is placed on the rationalization of potential to other neurons.its operation. Concerning learning of a living organism, Hebb

proposed a mechanism that the connection betweenneurons is strengthened according to the frequencyof signal-passing and the signal easily gets through

2 . Simulation of the nerve system such a connection. This makes the neural network ofa living organism plastic to form memory and

2 .1. Biological nerve system in essence cognition [33]. This is called the Hebbian rule. AnANN is a network of artificial neurons, where the

The idea of modeling of the biological neuron can Hebbian rule is applied to the connections of neu-be traced back to when McCulloch and Pitts found rons.the mechanism of the nerve system of a livingorganism in 1943[32]. A biological neuron, which isessential for information processing, has two types of 2 .2. Biological neuron and artificial neuronextensions: the dendrites and the axon. The dendritesreceive signals from other neurons through synaptic Let us consider how to simulate the neuron’sconnections to the cell body (soma). Here, the signal function. Neuronj in Fig. 3 has the 0 and 1 states ofis actually an electric potential that may change the output corresponding to standing and excited states.potential of another soma. There are positive and In the standing state, the neuron produces a lownegative synaptic connections according to the potential (0) and the excited one gives a high

Fig. 3. How to simulate the biological nerve network system;x andW are the output strength from a neuron and the thickness of the fiberthat is connected to neuronj.


Fig. 5. Step function of a biological neuron is simulated by thesigmoid function.

Fig. 4. Function of neuronj. When the sum of signals (in-formation) from presynaptic neurons (x) through fibers (W )exceeds more than a threshold value (u ), neuronj ignites.

transfer function. In ANN, the activation function ofa neuron should be differentiable; the most popular

potential (1). The operation of such a neuron may be one is the sigmoid function.expressed as shown inFig. 4, where u is thethreshold value for ignition. Axisz represents the 1

]]]]f( y)5 (4)potential of the output and that ofy is for the amount 11 exp (2ay)of the input signal,SW x , which is the sum of theij i

potentials from presynaptic neurons.W is called anij According to the value ofa, this function changeselement of weight matrix (W ) between neuronj and from one of nearly linear function to such a strong-presynaptic neuroni. nonlinear function as the step function (Fig. 5). The

This function may be expressed as sigmoid function is appropriate in the backpropaga-tion learning, since it is differentiable.n

y 5OW x 2u 5WX 2uij i (1)i51

2 .3. Operation of artificial neuronz 5 f(y)

hereW and X are vectors (W , W ,.., W ) and (x , Let us examine the operation of each neuron. As1j 2j nj 1

x , . . . , x ) and n is the number of presynaptic Eq. (2) shows,W determines the output. Ify is set at2 n

neurons. Adding elementsW 5u andx 51 as 0, XW is interpreted as a supersurface in theX spacen11,j n11

(W , W ,..,W , u ) and (x , x , . . . , x , 1), Eq. (1) is that determines the position and the gradient.Fig. 61j 2j nj 1 2 n

rewritten simply as shows the concept of the neuron’s operation: thevectorX is expressed as dots and the supersurface is

y 5WX (2)the straight line on whichy is 0. It is understood thatthe z value of the upper side of the line is 1 and thatz in Eq. (1) is the output value of neuronj and is aof the lower side is 0. Namely, the neuron performsstep function which takes the following valuesrecognition/classification based onW that is givenby learning. The line of the step function as Eq. (3)1 if y .0

f( y)5 (3)H has no width, but that of the sigmoid function may0 if y ,0be wide and thez value is somewhere between 0

This function is called the activation function or (lower side) and 1 (upper side).


Fig. 6. Operation of an artificial neuron. According toW, the neuron determines 0 (lower region) or 1 (upper region) in theX space. If thesigmoid function is used, there appears a wide decision line on whose value is somewhere between 0 and 1.

3 . Basic theory of hierarchy-type neural input to A and output from B. Usually, the numbernetwork of neurons in the input layer is set to be equal to that

of the parameters (descriptors) plus 1 (bias), while3 .1. Network for classification that of the output layer is equal to the number of

categories. However, it should be noted that the biasShown inFig. 7 is the hierarchy-type three-layer for the second layer is unnecessary. The number of

network. In fact, it is easily shown that the multi- neurons in the second layer is arbitrary. However, alayer, which is more than three, is unnecessary. The small number may reduce the resolution abilitycircles are neurons that are, in simulation, variables whereas a large number consumes considerable timetaking a value ranging from 0 to 1. The data are in the training phase. The recommended number is

somewhere between the number of the input neuronsand double their number, unless the number of theinput parameters is small (e.g. less than|5). Here,

we recommend that the structure of a hierarchy-typeneural network is expressed asN(a,b,c), in which a,b and c are the numbers of neurons of the first,second and third layers[34].

The activation function of the first-layer neurons isusually set to be the liner function that outputs theinput value without any change [Eq. (5)]

O 5 y (5)j j

As the activation function of the second- and third-layer neurons, the sigmoid function is best usedalthough variation is possible. Namely, the value of aneuron (O ) at the second or third layer can beFig. 7. Three-layer network for classification. The data is input to j

expressed by Eq. (6)A and output from B.


the linear function [Eq. (1)]. This type of neural1]]]]]O 5 f(y )5 network is termed an MR-type network[35].j i 11 exp (2ay )f gi (6)

y 5 OW x 2ui ij i jS D 3 .3. Trainingj

Given N neurons at the first layer, a vector canwherex is one of the values of a neuron at the firstiexpress a set of the input data withN elements foror second layer;W , an element of the weight matrix,ijthe input neurons, which is called here an ‘inputexpresses the weight value of the connection be-pattern’. Likewise, the output data can also between neuronsi andj, and takes either a positive or aregarded as a vector and be called an ‘output pattern’negative value;a is a parameter which expresses the(O ). The vector, which is compared with an outputnonlinearity of the neuron’s operation;u is the jjpattern to fixW , is called a ‘training pattern’ (t ).threshold value for neuronj. Usually, one needs not ij j

The training of the network is based on the followingset this value if constant 1 is always given as a biasequations:to one of the neurons in the input layer. The reason

is that the connection weights between the neuron (n21,n) (n)dW 5 2 d x ´ (7)ij j iwith constant 1 and any of the neurons in the second

layer are optimized during the learning (training)(3)d 5 (O 2 t )g0(y ) (8a)phase to play the same role as that of the optimized j j j j

u. The sigmoid function is not unique as the activa-(2) (2,3) (3)tion function, f(y), and may be replaced by any d 5 OW d f0(y ) (8b)j jl l jS Dfunction with an appropriate operation if it is dif- l

ferentiable. But such variation may be meaningless.Here, f0() and g9() are the derivative functions of

the activation functions for the second- and third-3 .2. Network for fitting 1ayer neurons, f() and g() respectively, and´ is a

parameter which determines the shift for correctionThere is not much difference in the network in backpropagation. The superscripts onW and d

structure between classification and fitting (Fig. 8). indicate the relevant layer(s), and thus Eq. (8a) isThe only difference is the activation function for the used only to correct the connection between thethird-layer neuron: the activation function should be second and third (output) layers while Eq. (8b) is for

another connection. If f and g are the sigmoid functions, the derivative functions, f9 and g9 in Eq.

(8) are

f0(y )5 g0(y )5 f( y ) 12 f( y ) a (9)f gj j j j

In the above equations, both́ and a can be setindependently of the layer.

Since the value of each neuron is defined between0 and 1, the given data must be scaled within thedefined region. Here it should be noted that if thevalue of a neuron in the input layer is zero, theconnections from such a neuron are always null, i.e.the information from that neuron is not propagated tothe second and the third layers. To avoid thisdifficulty, the smallest value should be set at slightlylarger than zero, typically 0.1. Therefore, the follow-Fig. 8. Three-layer network for fitting. The activation function for

the third layer should be a linear function. ing scaling equation may be used:


operation. Aoyama and Ichikawa studied basic oper-(q 2 q )p 1 q p 2 q pmax min min max max min¯ ]]]]]]]]]]]p 5 (10) ating characteristics of neural networks when appliedp 2 pmax min to structure–activity studies[37]. According to theirreport, one may see an asymmetric character in thewhereq andq are the maximum and minimummax min classification. A typical example is as follows. Theyvalues of scaling,p is the element of the inputplaced two regions, class 1 and class 2 (Fig. 9). Afterpattern to be scaled, andp and p are themax min training those classes, they examined how pointsa tomaximum and minimum values of the elements ofd and pointso, p andq are classified. Pointsa andbthe actual input patterns. Finally, each element of theare assigned to class 1, andc andd to class 2. It was,training pattern should also be between 0 and 1.however, shown that the output patterns for pointso,There may be a number of ways to express thep and q are unanimously shifted to region 2, thegraded categories. A simple way is to use the patternregion with larger positional data.(0,0,0,0,1) for, for example, the first and (1,0,0,0,0)

As already discussed, one can allow the neuralfor the fifth grade in the five-graded classification.network do a job similar to that of multiregressionBy doing so, one can observe the degree of contami-analysis. In fact, the analysis by the network alwaysnation in the determined class from other classes.gives better results. This stems from the basicTraining is carried out according to the abovedifference between the two analytic methods, that is,backpropagation algorithm until the error functionmultiregression analysis seeks linear dependencewhile the network uses nonlinear dependence includ-2E 5O(O 2 t ) (11)j j ing linear dependence. This nonlinear fitting impliesj

an asymmetric fitting. How the fitting is carried outbecomes small enough. can be seen in that of the straight line, for example,

Using random numbers as all elements take the y 5 x. Fig. 10 shows how fitting is carried outvalues between21 and 1 creates the initial weight according to the given threshold value of the errormatrices. Even whenM sets of the input and training function (convergence), where the straight dotted linepatterns are given, all of the output patterns can be is y 5 x on which the input and training data aremade close enough to the training patterns by located. The network structure wasN(2,5,1). Oneiteration through Eqs. (7) and (8) owing to the may observe an interesting characteristic of theconvergence theorem of perceptron[36]. If conver- fitting performance: at both ends (i.e. 0.0 and 19.5)gence is attained, then the neural network for classi- the difference between input and calculated valuesfication has an ability to classify the input patternsinto M groups while that for fitting performs the

function as a nonlinear multiregression analysis.

4 . Characteristics and/or problems of theneural network’s operation

4 .1. Basic operating characteristics

The neural network accepts any kind of data thatis expressed numerically. In the learning phase,information about the relationship between the inputand training patterns is accumulated in the weightmatrices. Once these matrices are determined, thenetwork predicts the categories or intensity of un- Fig. 9. Symmetric learning. Positional data on thex–y plane fortrained data, even if they are out of the defined regions 1 and 2 are trained and positionsa–d and o–q are

examined to know how they are classified.regions. We need to know the characteristics of such


1: [hp j,ht j] 2:[ hp j,ht j] 3:[ hp j,ht j] 4:[ hp j,ht j]i i i i i i i i

Among those training sets, there may be pluralcombinations that are equivalent. Removing suchredundancies, the effective number of combinationsis determined to beN. After N number of trainings,the symmetric output pattern is derived by

[1,2 [3,4

21N O O 1O (12O ) (12)H JIi IiI I

where the first summation covers groups 1 and 2 andthe second groups 3 and 4. By this procedure onecan obtain completely symmetric results[37]. How-ever, such a procedure might be unnecessary inpractical use.

4 .2. Some difficulties

The operation of ANNs is basically nonlinear. Thecharacteristics of nonlinear classification and fittingmay be illustrated using a two-dimensional space.

Fig. 10. How fitting is carried out. The dotted line representsFig. 11 is for classification. When one wants toy 5 x.classify the open and filled circles, there are two

becomes substantial and the deviation appearsways: one is to use the linear line (actually linearsinusoidally along the straight line. It was also super-surface) and the other is to use the nonlinearshown that as thea value is reduced to a smaller line (nonlinear super-surface). One may easily under-value, the sinusoidal deviation becomes small. This stand the advantage of the nonlinear separation. Atis understandable because thea value in Eq. (4) can the same time, it is easily predicted concerningbe regarded as a nonlinearity parameter. nonlinear classification that if one of the open circles

It was observed that an asymmetric character close to the separation surface is lacking, a differentgenerally appears in both classification and fitting. separation surface is so formed that the removedHere, we discuss this problem, although the degree is

small and it is unnecessary to worry about it inpractical use. The asymmetric behavior stems fromthe asymmetric evaluation ofWij in Eq. (6). Namely,the information with a large value of the inputelement propagates intensively to the second layerthrough the weight matrix. It is, therefore, possible tomake the network perform a symmetric operation byadopting a symmetric output function or simplyadopting a ‘symmetrical training’ procedure. Let usexplain the latter method.

Suppose that the input data,hp j, and the trainingi

data, ht j, are scaled between 0 and 1. Then theiFig. 11. In (A) the dot straight line is the supersurface of linearreverse ofhp j and ht j is defined ashp j(512p )i i i i separation, while the solid curve is that of nonlinear separation. In

and ht j(512t ). Using them, the following fouri i (B) if the blank circle lacks, a new supersurface of nonlineargroups of backpropagation combinations are consid- separation is created and the lacked circle may be judged to beered. back.


5 . Relationship of operation between ANNs andconventional methods

5 .1. The ALS method [38]

Now consider the relationship of operation be-tween the ALS method and the hierarchy neuralnetwork [34]. The discriminative functionL in theALS method is expressed using anm number ofdescriptorsx and weight coefficientsw as

( j ) ( j ) ( j )L 5w 1w x 1w x 1 ? ? ? ; L0 1 1 2 2

( j )5XW (13)

The rule of discrimination is given according to theFig. 12. Nonlinear operation in fitting. If the braced data are taken values ofL. Thus, if a ,L,a then the group isn n11out, a new fitting curve (B) may be created. Then the braced data placed in classn.may not be correctly predicted. ( j )The weight coefficient at cyclej, W is obtained

by

( j ) t 21 t ( j )W 5 (X X) X S (14)circle may not be predicted (Fig. 11B). The samething can apply to the case of fitting (Fig. 12). Using the correction termC, S is given byAnother problem is that such a flexible nonlinear

( j11) ( j )S 5L orfitting line easily adapts itself even to fit errors (Fig. i (15)( j ) ( j )13). Nonlinear operation has merits and demerits. To 5L 2Ciavoid demerits, one needs to introduce some linear

Here, consider the role of termS in Eq. (15). Sincecharacter to the network.the dimension ofW is null, S must have the samemathematical and physical characteristics asX.Therefore, the expression by Eq. (15) is appropriate

and unique since there is no other quantity equivalentto X in the ALS system. Eq. (15) indicates theSreceives a feedback from the output and is, indeed, akind of backpropagation procedure. It is, therefore,easy to simulate the ALS operation in the neuralnetwork by imposing the following restrictions onthe neural network for classification.

1. Use a two-layer neural network.2. Use a linear output function for all neurons.3. Setu to be a .n

4. Setw 5w , wherej andk represent any of theij ik

neurons in the second layer.5. Give the training pattern that ignitesn number

of output neurons for then-graded classifica-tion.

Fig. 13. Excessive fitting. Even measurement errors may beincorporated as normal input data. It is, therefore, understood that the operation of the


ALS method is simply a special case of the neural constant 1. It should be emphasized here that addi-network: linear classification using a two-layer neur- tion of the constant 1 to the input data means that theal network. The details will be given elsewhere[34]. optimization ofu in Eq. (6) is carried out throughj

the weight matrix (W ). The neural network with Eq.ij

(18) performs the linear operation equivalent that of5 .2. The multiregression analysis [35]multiregression analysis. In order to exceed thislevel, it is necessary to introduce a nonlinear opera-Here, we describe the relationship between thetion in the network. This is possible by incorporatingoperation of the neural network and the multiregres-the hidden layers. By lettingO 5 y and using Eq.sion analysis. For simplification, let us consider a j j

(16) for the last layer, a generalized nonlinearthree-1ayer network. Since the operation expressedmultiregression analysis is established. However, theby Eq. (6) results in vector elements that are toolarger number of the neurons in the second layerclose to 0 or 1, Eq. (6) is not very suitable when it ismust be adopted, rather than the input layer, to avoidapplied to the situations where the values between 0loss of the information that the input pattern has. Inand 1 are important. Therefore, we considered a newconsequence, the operation of the three-layer neuraloperation equation. Without losing generality, onenetworks is said to be nonlinear. Since the linearcan omitu in Eq. (6), givingj

operation is included as a special case, the neuraly 5OW x (16)j ij i network is predicted to work far better than the ALS

imethod and the multiregression analysis in both

Namely classification and fitting.

y 5Wx (2)

whereW and x are the weight matrix and the input 6 . How to overcome defects of ANN’s operationvector, respectively. Thus, if all neurons of each and purposive extensionlayer are governed by Eq. (16), i.e.

Nothing can beat ANN as classification and fittingy 5W xz 5W y (17)1 2 machines. This is because ANN’s operation is non-then the output pattern,z, becomes linear. However, the excessive nonlinear operation

causes some inconvenience. Then, the informationz 5 (W W )x 5Wx (18)1 2 processing in ANN is parallel. The parallel process-

ing makes it difficult to trace the flow of informationwhereW andW are the matrices which express thel 2in the network and does not give the reason whyweights between the layers 1 and 2 and thosesuch a decision was made. This section discussesbetween the layers 2 and 3, respectively.those problems.The method of the multiregression analysis seeks

the optimal coefficients of the linear equation6 .1. How to deal with excessive nonlinear

z 5 a 1Ob x (19) operationi i i i

wherez and x are, respectively, the elements of theThe neural network based on Eqs. (2)–(4) per-expectation vector and input data. Eq. (19) is equiva-

forms a nonlinear operation. As discussed, a non-lently rewritten aslinear operation is not always convenient in practical

z 5B0(11 x) (20) application and, therefore, it may be preferable if alinear operation can be introduced into the neuralnetwork. Although this is possible by adopting aEq. (18), a special case of the neural network’ssmallera value for the sigmoid function, a (mathe-operation, shows that the operation is equivalent tomatically) simple way is to define a new activationthat of the two-layer network and to that of afunction as a combination of the sigmoid functiongeneralized multiregression analysis if the variablesand the linear function asare so set asx to be the observed values plus the


(3)O 5bh(y )1 (12b )y 5 f( y ) (21)j j j j ≠Oj (1,2) (2,3)]]5Of0(y )W g0(y )W (26)k ik j kj≠xi kHere, parameterb expresses the mixing degree ofthe linear operation to the nonlinear operation and, where f9 and g9 are, respectively, the differentialtherefore, by changingb, one can pour the linear functions of the activation function in the second andoperation into the network at any level. Ifb is set at third layers while the superscript onW expresses the0, the network can be expected to perform the linear layer’s order. The expression using Eq. (21) as theoperation correctly. In practice, however, a problem activation function turns out to bearises: if b is set close to 0, the neurons in the

(3)≠Ojsecond layer easily exceed the defined region (0–1), (2) (2)]]5O b a h(y )h12 h(y )jf k k≠xresulting in a destruction of learning. This difficulty i k

can be removed by introducing a concept of ‘neuron (2) (1,2)1 (12b ) Wg ikfatigue’ into the usual backpropagation learning

(3) (3)3 b a h(y )h12 h(y )j1 (1method[39]. Our experiences tell us that the smallest f j j

value of b is around 0.5 by the learning procedure (3) (2,3)2b ) g0(y )W (27)g j kjwithout the ‘neuron fatigue’ procedure.

The training of the network is similarly carried out Note that whenb is near 1, the partial derivative,based on Eqs. (7) and (8) until the sum of the ≠O /≠x , approaches 0 as the output value nears 0 orj i2squared errors,S(O 2 t ) [Eq. (11)], becomes small 1. This character stems from the sigmoid function.j j

enough. If the new activation function is adopted, thederivative function, f9() or g9() in Eq. (8) is 6 .2.1. Accuracy of the partial derivatives obtained

by the neural networkf0(y )5abh(y )[12 h(y )] 1 12b (22)i i i Unlike the linear multiregression analysis, one

cannot find the analytical method to determine theIn the above equations, both́ and a, and evenb,reliability of the results by the neural network. Wecan be set independently of the layer.may, therefore, discuss the tendencies of the fittingcurve or separation surface by using concrete nu-6 .2. How to obtain partial differential coefficientsmerical data, although such a tendency is not a proofby the neural network [39,40]of reliability. Table 1 shows the analytical and

2calculated derivatives fory52x andy5x functionsSince the operation of the hierarchy-type neuraltogether with the reproduced values by the network.networks is completely defined by the mathematicalThe network structure wasN(2,10,1), where, as aformula, it is possible to take the partial derivative ofrule, one of the neurons in the first layer was used asan output to any input parameters. Sincea bias. The 21 points, 0, 1, 2,. . . , 20,were used to

(1,2) train to simulate they52x function and the 21dy 5W dx (23)i ij ipoints, 210, 29, . . . ,21, 0, 1, . . . , 9, 10, for

2and y5x .One may understand that the network well re-

dO 5 f0(y )dy (24)i i i produced the function’s value. Except for the termi-nal points, the calculated derivatives are within thethe partial derivative of the output in the seconderror |5%, and such errors increase at the terminallayer becomesand extreme points. According to our experience,

(2) this is observed very generally and must be regarded≠Oj (1,2)]]5 f0(y )W (25) as a defect by the nonlinear fitting.j ij≠xiTable 2shows the derivative in case of classifica-

tion. We used they52x110 function; 12 equallyLikewise, the partial derivatives of the output in thedivided points of the region from (0,10) to (10,0) onthird layer with respect to an input parameter isthe line was selected and points from (0,10) togiven by


T able 1aCalculated derivatives of simple functions by the neural network

2y 52x y 5 x

9 9x y y9 x y y ycalc calc theort

0 0.14 1.93 210 99.97 218.00 2201 2.08 1.94 29 82.14 217.53 2182 4.03 1.96 28 65.15 216.33 2163 5.99 1.97 27 49.67 214.55 2144 7.96 1.98 26 36.18 212.40 2125 9.94 1.99 25 24.93 210.10 2106 11.93 1.99 24 15.98 27.83 287 13.93 2.00 23 9.24 25.68 268 15.93 2.01 22 4.56 23.70 249 17.94 2.01 21 1.80 21.84 22

10 19.95 2.02 0 0.85 20.07 011 21.97 2.02 1 1.67 1.71 212 23.99 2.02 2 4.30 3.57 413 26.01 2.02 3 8.85 5.56 614 28.02 2.02 4 15.49 7.74 815 30.04 2.91 5 24.38 10.06 1016 32.05 2.01 6 35.64 12.45 1217 34.05 2.00 7 49.24 14.72 1418 36.05 1.99 8 64.95 16.62 1619 38.03 1.97 9 82.27 17.89 1820 40.00 1.96 10 100.45 18.32 20

a 25The network structure wasN(2,10,1), wherea54, b51, and the threshold of the convergence,10 in terms of the scaled units.

(4.17,5.83) were trained as class 1, other points asT able 2 class 2. The network structure wasN(3,10,2). As theaDerivatives in classification

center (5,5) is the inflection point, derivatives aroundPoint Variable the center appear to be steep gradients, while those

x y around the trained points are zeros. The results arereasonable and satisfactory.1 (0,10) 0.00 (0.00) 0.00 (0.00)

2 (0.83,9.17) 0.00 (0.00) 0.00 (0.00)3 (1.67,8.33) 0.00 (0.00) 0.00 (0.00) 6 .2.2. Is the independency between input neurons4 (2.50,7.50) 0.00 (0.00) 0.00 (0.00) kept?5 (3.33,6.67) 20.03 (0.03) 0.03 (20.03) When the input parameters are independent of6 (4.17,5.83) 20.75 (0.74) 0.75 (20.74)

each other, one may wonder whether or not the7 (5.83,4.17) 20.74 (0.73) 0.74 (20.73)decision or derivative given by the neural network is8 (6.67,3.33) 20.04 (0.03) 0.04 (20.03)

9 (7.50,2.50) 0.00 (0.00) 0.00 (0.00) influenced by the values of other input neurons. This10 (8.33,1.67) 0.00 (0.00) 0.00 (0.00) should be severely checked if one discusses the11 (9.17,0.83) 0.00 (0.00) 0.00 (0.00) interpolated values, since they are not the trained12 (10,0) 0.00 (0.00) 0.00 (0.00)

points. To determine such independency we intro-a The network structure wasN(3,10,2), wherea51, b51, and duced dummy input neurons by which the predic-23the threshold of the convergence,10 in terms of the scaled

tions and decisions are checked. The network struc-units. Points from 1 to 6 were used to train as class 1 while thoseture wasN(3,10,2).from 7 to 12, as class 2. The values in parentheses are the

derivatives for class 2. This time the 21 points with the dummy neuron


T able 3aFluctuation of predicted function’s values and their derivatives by change of the dummy neuron’s value iny52x

Function’s values Derivativesb cx 210 25 0 5 10 210 25 0 5 10

0.5 1.17 0.67 0.44 0.59 1.19 1.86 1.90 1.90 1.88 1.831.5 3.05 2.58 2.36 2.49 3.04 1.89 1.93 1.94 1.92 1.872.5 4.95 4.53 4.32 4.42 4.93 1.92 1.97 1.98 1.95 1.903.5 6.89 6.52 6.31 6.39 6.85 1.95 2.00 2.01 1.99 1.944.5 8.85 8.53 8.34 8.40 8.80 1.97 2.03 2.04 2.02 1.965.5 10.83 10.57 10.40 10.43 10.77 1.99 2.05 2.07 2.05 1.996.5 12.83 12.64 12.48 12.50 12.77 2.01 2.07 2.10 2.07 2.017.5 14.85 14.72 14.59 14.58 14.79 2.03 2.09 2.12 2.09 2.038.5 16.88 16.82 16.71 16.68 16.83 2.04 2.11 2.13 2.11 2.049.5 18.92 18.93 18.84 18.79 18.87 2.04 2.11 2.14 2.12 2.05

10.5 20.97 21.04 20.99 20.91 20.92 2.05 2.12 2.15 2.12 2.0511.5 23.02 23.16 23.13 23.04 22.98 2.05 2.12 2.15 2.13 2.0612.5 25.07 25.28 25.28 25.16 25.03 2.04 2.12 2.14 2.12 2.0513.5 27.11 27.39 27.42 27.28 27.08 2.04 2.11 2.13 2.11 2.0514.5 29.14 29.49 29.55 29.39 29.12 2.03 2.09 2.12 2.10 2.0315.5 31.26 31.58 31.66 31.48 31.15 2.01 2.08 2.10 2.09 2.0216.5 33.16 33.64 33.75 33.56 33.16 1.99 2.05 2.08 2.06 2.0017.5 35.14 35.68 35.82 35.61 35.15 1.97 2.03 2.06 2.04 1.9818.5 37.09 37.70 37.86 37.64 37.11 1.94 2.00 2.02 2.01 1.9519.5 39.02 39.68 39.87 39.63 39.05 1.91 1.97 1.99 1.98 1.92

a The network structure wasN(3,10,1), where one of neuron in the first layer was use as the dummy input. Other network parameters andthe convergence condition were the same as those inTable 1.

b Predicted point.c Indicates that the value of210 was input to the dummy input.

were trained at the same time, in which the value of near the inflection point the fluctuation tends to bethe dummy neuron was changed210, 25, 0, 5, and large.10. Namely, the 21 input data with210 for the The results for the classification are shown indummy neuron, the same 21 input data plus25 for Table 5 where only two points, (0,10) and (10,0),the dummy neuron, and so on. By doing so, one can were trained. As one can see, the fluctuations in bothteach the network that the dummy neuron is in- cases are small enough to say that the independencydependent of the decision. Then intermediate points is virtually maintained.between the training points were predicted by the From the above results, it may be generally saidnetwork, where various values for the dummy neu- that if the independency of input parameters isron 210, 25, 0, and 10, were input to see the properly incorporated in the network by training, theinfluence of the dummy neuron.Table 3shows the predicted values and their derivatives are not in-results. The predicted values are not very accurate at fluenced by the values of other input parameters.the end, and fluctuations by the dummy neuron However, a small amount of fluctuation appearsbecome large near both ends, especially at the low at /near the terminal and extreme points.end.

2Table 4 shows the case of they5x function. 6 .2.3. Isolation of functions out of the mixedSince the neural network is good at treating the functionsnonlinear correlation, the fluctuation in both the It is possible to take the partial derivatives of thepredicted and derivative values is smaller than that of output strength with respect to each input parameter.linear relation. However, one can see that as a rule, This means that isolation of individual linear func-


T able 42 aFluctuation of predicted function’s values and their derivatives by change of the dummy neuron’s value iny5x

bx Predicted values Derivativesc

210 25 0 5 10 210 25 0 5 10

29.5 90.93 90.89 90.76 90.66 90.69 217.89 217.98 218.00 217.96 217.8728.5 73.44 73.32 73.19 73.14 73.25 216.99 217.04 217.04 216.99 216.9227.5 57.17 57.01 56.89 56.88 57.05 215.47 215.49 215.47 215.43 215.3926.5 42.65 42.49 42.39 42.41 42.60 213.51 213.50 213.48 213.45 213.4425.5 30.24 30.09 30.02 30.05 30.24 211.31 211.28 211.25 211.24 211.2524.5 20.06 19.96 19.91 19.95 20.12 29.04 29.00 28.97 28.07 29.0023.5 12.14 12.08 12.05 12.09 12.22 26.83 26.78 26.76 26.76 26.8122.5 6.37 6.36 6.35 6.38 6.46 24.73 24.68 24.66 24.68 24.7321.5 2.63 2.67 2.68 2.69 2.72 22.76 22.72 22.70 22.72 22.7820.5 0.81 0.89 0.91 0.90 0.87 20.89 20.85 20.84 20.87 20.93

0.5 0.84 0.96 0.98 0.94 0.85 0.93 0.97 0.97 0.94 0.881.5 2.68 2.84 2.87 2.79 2.64 2.77 2.80 2.80 2.77 2.722.5 6.40 6.59 6.61 6.51 6.31 4.68 4.71 4.71 4.69 4.633.5 12.08 12.31 12.33 12.21 11.96 6.71 6.75 6.75 6.73 6.684.5 19.86 20.12 20.16 20.02 19.73 8.87 8.92 8.93 8.92 8.885.5 29.86 30.17 30.23 30.07 29.75 11.14 11.19 11.22 11.21 11.196.5 42.15 42.52 42.60 42.45 42.11 13.43 13.49 13.52 13.52 13.517.5 56.67 57.09 57.21 57.07 56.72 15.57 15.63 15.67 15.68 15.678.5 73.16 73.65 73.81 73.68 73.33 17.34 17.41 17.45 17.46 17.469.5 91.16 91.71 91.90 91.80 91.44 18.53 18.59 18.63 18.65 18.65a The network structure wasN(3,10,1), where one of neuron in the first layer was use as the dummy input. Other network parameters and

the convergence condition were the same as those inTable 1.b Predicted point.c Indicates that the value of210 was input to the dummy input.

T able 5aFluctuation of predicted values and derivatives by change of the dummy neuron’s value in classification

bPoint Predicted values Derivativesc10 5 0 25 210 10 5 0 25 210

1 (0,10) 0.978 0.978 0.978 0.977 0.977 0.065 0.065 0.064 0.064 0.0642 (0.83,9.17) 0.988 0.966 0.966 0.966 0.965 0.118 0.117 0.116 0.115 0.1143 (1.67,8.33) 0.944 0.944 0.944 0.944 0.943 0.223 0.220 0.218 0.215 0.213

0.4 (2.50,7.50) 0.901 0.902 0.902 0.902 0.902 0.428 0.422 0.417 0.412 0.4075 (3.33,6.67) 0.821 0.823 0.823 0.824 0.825 0.784 0.774 0.765 0.756 0.7486 (4.17,5.83) 0.686 0.687 0.689 0.691 0.692 1.227 1.219 1.211 1.202 1.1927 (5.83,4.17) 0.500 0.502 0.504 0.506 0.508 1.446 1.448 1.449 1.449 1.4488 (6.67,3.33) 0.315 0.316 0.318 0.819 0.320 1.209 1.219 1.229 1.238 1.2469 (7.50,2.50) 0.180 0.180 0.180 0.181 0.182 0.766 0.755 0.784 0.793 0.802

10 (8.33,1.67) 0.100 0.100 0.100 0.100 0.100 0.419 0.424 0.429 0.434 0.43911 (9.17,0.83) 0.058 0.057 0.057 0.057 0.057 0.220 0.221 0.224 0.226 0.22812 (10,1) 0.035 0.035 0.034 0.034 0.034 0.117 0.118 0.118 0.119 0.1201 (0,10) 0.023 0.023 0.022 0.022 0.022 0.065 0.066 0.066 0.066 0.066

a Only two points, (0,10) and (10,0) on thex–y plane were used to train as classes 1 and 2, respectively. The network structure wasN(4,19,2). One of the first ayer neurons was used as the dummy input. Other network parameters and convergence conditions were the sameas those inTable 2.

b Predicted point.c Indicates that the value of210 was input to the dummy input.


T able 6 T able 8aExample of input and training data to isolate individual functions Recognition test inx 1 ay

out of their combined functiona 1 2 Ratio

Sample Input data Training data0.6 0.982 0.595 0.606

no. x12yx y 0.7 0.991 0.689 0.695

0.8 0.989 0.795 0.8041 8.5 0.6 9.7

0.9 0.998 0.895 0.8972 0.1 2.0 4.1

a3 9.3 4.1 17.5 The network structure wasN(3, 10, 1), wherea51, b51,254 0 0.6 1.2 and the threshold of the convergence,10 in terms of the scaled

units.

tions out of the mixed function is possible. Weperformed this separation test to figure out the according to the fed random numbers, one can seeminimum sample number of practical isolation ac- the neural network can detect the individual func-cording to the following procedures. For example, let tions by a relatively small number of data. Such aus consider the isolation anx and 2y out of thex12y number depends on the complexity of the mixedfunction. As shown in Table 6, using random function. In a simple relationship likex12y, onlynumbers forx and y which are input patterns, the five sets of data seems to be enough to separate

2values for x12y are obtained and used as the them. However, in the rather complexx2y1ztraining pattern. By obtaining the partial derivatives, function, 15 sets of random numbers were needed.one can know the relationship,x12y. The number ofsample data averages the derivatives. 6 .2.4. Recognition of two similar functions

Table 7 shows the results forx12y, x 2y, x1 As already mentioned, the neural network easily2 22y13z, x1y , and x2y1z , where the values in recognizes the two functions,x12y. We then ex-

parentheses show the maximum deviations from the amined the limit of such separation ability using themean values. Although these derivatives may vary function,x1ay, where a was changed from 0.6 to

T able 7aIsolation of individual functions out of the mixed function

5 10 15 20 30

x12yx9 0.953 (0.188) 0.979 (0.121) 0.977 (0.143) 0.985 (0.519) 0.989 (0.141)y9 1.946 (0.273) 1.953 (0.137) 1.949 (0.161) 1.949 (0.134) 1.970 (0.136)

x–yx9 1.433 (0.388) 0.972 (0.077) 0.973 (0.093) 0.985 (0.066) 0.984 (0.011)y9 20.674 (0.197) 20.962 (0.085) 20.976 (0.101) 20.981 (0.086) 20.996 (0.126)

x12y13zx9 0.848 (0.411) 0.979 (0.254) 0.968 (0.200) 0.989 (0.254) 1.001 (0.168)y9 1.636 (0.690) 1.928 (0.230) 1.949 (0.260) 1.931 (0.339) 1.980 (0.301)z9 3.263 (1.269) 2.915 (0.558) 2.962 (0.460) 2.921 (0.441) 2.960 (0.410)

2x 1 yx9 1.646 (4.014) 1.111 (0.528) 1.055 (0.081) 1.007 (0.109) 1.009 (0.072)

2x–yx9 1.656 (3.953) 0.963 (0.622) 1.073 (0.442) 1.010 (0.446) 0.994 (0.366)

2x 2 y 1 zx9 20.350 (1.052) 0.449 (0.370) 1.065 (0.188) 1.008 (0.315) 1.051 (0.157)y9 0.005 (0.965) 20.406 (0.343) 20.936 (0.430) 21.025 (0.169) 21.009 (0.193)a 25The network structure wasN(n,10,1); a54, b51, and the convergence condition,10 in terms of the scaled units. Averaged

derivatives are shown. The values in parentheses are the maximum deviations from the mean values.


0.9. In order to overcome the problem of scarcity of neural network can distinguish the two functions2data, we used 40 sets of random numbers indepen- even ats being 0.05. However, whena51.1, the

2dently for x andy. The results are shown inTable 8. separation cannot be carried out unlesss is .0.1.It is surprising that if the data is sufficient, the neuralnetwork can distinguish the two function,x and 0.9y,out of the function,x10.9y. 6 .2.6. Application the partial derivative method

[40]6 .2.5. Recognition two similar functions by The recognition tests of individual functions out ofcorrelated input data the combined function indicate that the recognition

So far, we used completely random numbers ability of the hierarchy neural network seems to beindependently for the variablesx, y, or z. Actual data powerful, in fact, far better than we had expected,for quantitative structure property relationship although the reliability is not given in terms of any(QSPR) analysis, however, have some kinds of mathematical formula. As an example of application,correlation among them. It is, therefore, necessary to we used a rather standard problem: the relationship

13determine the relationship between the separability between the C NMR chemical shift and theand the dispersion among input data. To this end, we configuration of the subsistent in norbornanes oragain used thex1ay function, where the input data norbornenes. This problem is the so-called Kowalskifor x andy are correlated by the following way. Thesimilarity of a data set is measured using the T able 10

a2 2 Derivatives of norbornane and norbornenedispersion,s [5S(x 2y ) /N (N is the total num-i i

ber of the data sets (540)]. The correlation wasmade by discarding the sets of data (x ,y ) if (x 2i i i

2 2y ) . j. If j is chosen to be 0.82, 0.59 or 0.37,s isi

approximately 0.2, 0.1, or 0.05. The cases with thecoefficient,a, being 1.1, 1.3 and 2.0 were examined.Table 9 shows the results. Whena is .1.3, the \

Substituent Compound no.T able 9Recognition of two similar linear functions by correlated input Exo Endo

adataCH 1 143

Trial NH 2 152

OH 3 161 2 3 4 5 6

COOH 4 172

s 50.05 (a 51.1) CH OH 5 182bx 1.009 1.058 1.034 1.082 1.103 1.112 CH ,=O(3) 6 193

y 1.079 1.034 1.052 1.005 0.981 0.981 CH ,=O(5) 7 203

CH , F (6) 8 212 3 2s 50.1 (a 51.1) cCH , 5=6 9 223x 1.022 1.025 0.953 0.976 0.977 1.022OH, 5=6 10 23

y 1.070 1.064 1.141 1.115 1.105 1.072CH , CH (4) 11 243 3

2s 50.2 (a 51.1) CH , =CH (3) 12 253 2

x 1.009 1.039 0.988 0.977 1.012 0.978 OH, CH (1), (CH ) (7) 13 263 3 2

y 1.082 1.055 1.105 1.115 1.074 1.104 CN 26 27COOCH 28 292 3s 50.05 (a 51.3)CH , =O(6) 30 313x 1.006 1.082 1.110 1.120 1.115 1.067CH , F (5) 32 333 2y 1.281 1.209 1.174 1.116 1.168 1.225CH OH, 5=6 34 352

2s 50.05 (a 52.0) Cl, CH (1), (CH ) (7) 37 383 3 2

x 1.085 1.106 1.153 1.178 1.130 1.230 a The data are quoted from the literature[41].y 1.90 1.881 1.829 1.807 1.845 1.764 b Indicates that the attached substituent is at position 3.

a cAll conditions concerning the network as inTable 8. Indicates the double bond is between positions 5 and 6.


T able 1113 aRelative C NMR chemical shifts and conformations in norbornanes and norbornenes

Compound C C C C C C C Exo/endo1 2 3 4 5 6 7

1 6.7 6.7 10.1 0.5 0.2 21.1 23.7 Exo2 8.9 25.3 12.4 20.4 21.2 23.1 24.4 Exo3 7.7 44.3 12.3 21.0 21.3 25.2 24.4 Exo4 4.6 16.7 4.4 20.2 20.3 21.0 21.8 Exo5 1.8 15.1 4.4 20.2 0.2 20.7 23.3 Exo6 5.7 3.0 2.6 20.5 20.4 0.7 23.5 Exo7 6.1 5.9 10.6 0.6 0.2 0.2 23.7 Exo8 6.5 6.3 10.4 0.3 20.8 20.1 23.5 Exo9 6.5 7.5 9.5 0.5 1.7 0.7 23.8 Exo

10 7.8 47.0 11.7 21.3 3.9 22.7 23.2 Exo11 6.9 6.4 10.1 0.7 21.2 0.1 23.9 Exo12 5.6 4.9 7.0 0.2 21.1 0.2 23.9 Exo13 2.5 42.5 11.9 20.8 21.1 22.4 1.4 Exo14 5.4 4.5 10.6 1.4 0.5 27.7 0.2 Endo15 6.8 23.3 10.5 1.2 0.6 29.5 0.3 Endo16 6.3 42.4 9.5 0.9 0.2 29.7 20.9 Endo17 4.2 16.2 2.1 0.9 20.6 24.8 1.9 Endo18 1.7 12.8 4.0 0.4 0.2 27.2 1.4 Endo19 4.7 3.1 2.2 0.3 1.3 26.5 20.6 Endo20 4.7 5.3 9.2 1.3 20.4 26.5 1.4 Endo21 4.6 11.5 8.9 20.1 0.8 0.4 1.8 Endo22 5.6 7.5 8.7 1.4 1.7 23.0 1.7 Endo23 7.1 47.8 13.3 2.2 3.6 23.4 0.6 Endo24 4.1 1.2 7.0 0.7 0.5 27.4 0.0 Endo25 3.2 40.2 10.4 20.5 0.0 210.3 3.1 Endo

26 5.5 1.0 6.3 20.3 21.5 21.6 21.3 Exo27 3.4 0.1 5.5 0.2 20.7 24.9 1.0 Endo28 5.1 16.4 4.2 20.4 21.1 21.4 22.1 Exo29 4.0 15.9 2.2 0.7 20.7 25.0 1.7 Endo30 6.6 7.0 10.1 0.2 21.2 0.5 23.7 Exo31 6.0 8.4 11.2 20.1 0.7 21.5 21.6 Endo32 6.3 7.2 9.8 0.7 20.1 0.8 23.5 Exo33 5.1 4.8 8.4 1.1 20.1 27.3 1.6 Endo34 1.9 17.1 5.2 20.1 0.9 0.9 23.4 Exo35 2.3 18.3 5.0 0.3 1.3 22.9 1.4 Endo36 5.1 4.0 8.4 1.1 0.2 27.7 1.6 Endo37 2.9 30.3 13.4 20.5 22.1 20.7 2.0 Exo38 3.7 29.8 10.8 21.6 21.1 29.0 2.2 Endo

a The data are quoted from the literature[41].

problem [40] and has been frequently quoted for data as training data. The network structure was setrecognition examinations. to beN(8,14,2).

Shown in Tables 10 and 11are the endo/exo Table 12shows the averaged partial derivatives.13configurations and the relative C NMR chemical The absolute values for each set of parameters are

shifts in the derivatives of norbornane and norbor- nearly the same, but the signs are opposite to eachnene, quoted from the literature[41], where the same other. This means that each input parameter oppo-compound numbers are used. In accordance with sitely contributes to the exo/endo decision and itsformer studies, we used 25 (nos. 1–25) out of 38 magnitude is the same. The absolute values for


T able 1213 aCorrelation analysis between the C NMR chemical shifts and the configuration

Configuration C C C C C C C1 2 3 4 5 6 7

Exo 20.009 0.088 0.031 20.111 20.107 0.141 20.213Endo 0.001 20.088 20.032 0.110 0.109 20.141 0.214

a 23The network structure wasN(8,14,2), wherea51, b51, the threshold of the convergence,10 . Averaged values are shown.

parameters 1 and 3 are negligibly small enough to 6 .3.1. Introduction of the forgetting procedure intosay that these parameters have almost nothing to dothe learning phasewith the exo/endo decision. The largest absolute The training is carried out according to the usualvalues are found in parameter 7 showing that this backpropagation algorithm until the error functionparameter has the major contribution to the decision. [Eq. (11)] becomes small enough. SupposeM sets ofThese results are in good accord with chemical the input and training patterns are given; all of theexperience. The merit of the present method, how- output patterns can be made close enough to theever, is that one can quantitatively handle the degree training patterns by iteration through Eqs. (7) andof the contribution of each input parameter. (8). If convergence is attained, the neural network

has the ability to classify the input patterns intoMgroups.

6 .3. Reconstruction learning [34] Here we consider the procedure in which theabsolute values of weight matrices are lessened by

As we study the operation of such neural net- the equationworks, we are surprised to see how the behavior ofthe neural network resembles that of the brain. In the W 5W 2 sgn (W )z 12D(W ) (28)h jij ij ij ijneural network, the information, which is accumu-lated in the learning phase, is kept as the strength ofthe connections between the neurons recorded in the where D is a function that gives 1 atuW u,z or 0 atij

weight matrices. As our experience tells us, we learn uW u.z and z is set to be about a tenth of´ as anij

things repeatedly through the processes of learning initial value and is varied so as not to greatly changeand forgetting. The memory that is obtained through E [Eq. (11)], i.e. not to greatly change the decisionsuch processes is settled firmly in the mind. We by the network. If this procedure is applied to theconsidered that these processes may be incorporated network, some of the information, which is given byin the hierarchy neural networks, and when this is training, is partly erased. This corresponds to forget-done, it is interesting to know what happens to the ting in memory and is termed ‘erasing’. We call hereconnections of the neurons. the training procedure forM sets of data a ‘training

It is shown that the weight matrices are not unique cycle’. If after a training cycle is carried out, theeven if the network gives the same results[35]. This erasing procedure is applied to the same network, theindicates that various kinds of reconstruction of the information which is accumulated in the trainingweight matrices are possible. Therefore, we tried to cycle, is partly lost from the network. Remarkably,introduce the procedures of both the learning and we discovered that these contradictory procedures doforgetting processes into the learning phase of the not vary in all connections equally. Some connect-neural network. The weight matrices thus obtained ions are affected by the training cycle more stronglyare called ‘reconstructed matrices.’ The recon- than by the erasing procedure to give strongerstructed matrices were surprising and suggestive. connections and others are affected by the erasingThey were found to be widely applicable in finding procedure strongly to give weak or null connections.active neurons of the network and could serve in the Therefore, the information accumulated betweenanalysis of the relationship between the input and neurons can be reconstructed without changing theoutput data. contents of the information originally embodied in


T able 13 neurons. Thus, the network structure isN(4,4,1) withBackpropagation learning (A) and reconstruction learning (B) a being 4. The data were as follows: The values 0,

1 2 3 4 1,2, . . . , 10 were fed to neuron 1 of the first layer.For each input value, training data were 1, 2,A

1 1.357 20.116 0.061 0.040 9,..,100. One-digit random numbers were fed to2 20.511 20.231 0.350 0.339 neurons 2 to 4 of the first layer.3 0.316 20.114 20.786 0.018 Table 13A shows the results by the usual back-4 0.206 0.086 20.271 0.018

propagation method whileTable 13Bare those byBthe reconstruction learning method. In both tables,1 1.217 0.002 0.003 0.009

2 0.000 0.000 0.000 0.000 the numbers of the first line indicate the neuron3 0.000 0.000 0.000 0.000 number of the first layer and those of the first4 0.000 0.000 0.000 0.000 column, the second layer.Numbers in the top line are those of neurons in the first layer, It is seen that connections by the backpropagation

while the numbers in the left column are those of neurons in the learning involve all neurons between the first and thesecond layer. second layers. On the other hand, in the reconstruc-

tion learning method, most connections are null: thesurvived connections are those between both first

the network. This series of procedures was termed neurons and those between the 2–4 or the first layer‘reconstruction-learning’[34]. neuron and the first neuron of the second layer. But

Reconstruction-learning often reveals the role of the values are negligibly small. This demonstrates2each neuron and gives characteristic connections that to simulate they5x function, the second layer

between the neurons; if one traces the connections needs only one neuron and that the first layerbetween the input and output neurons, then one can essentially needs one neuron. It is understandableunderstand the role of the input parameters in the why the small values of connections starting fromdecision or the output intensity. 2–4 neurons of the first layer exist: the neural

2network expands they5x function by using thesigmoid functions. Since only one sigmoid function

26 .3.2. How does reconstruction learning work? cannot completely adapt itself to they5x function,As a simple example, we show the case that the some compensation is necessary to accurately ex-

2relationship ofy5x is trained with three dummy press the function.

T able 14aMatrices by the backpropagation learning

1 2 3 4 5 6 7 8 9 10 11 12 13 14

b(A) Weight matrix between the first- and second-layer neurons

1(C ) 20.022 0.133 20.188 20.147 0.047 0.063 20.009 0.112 20.044 20.214 20.117 20.112 0.092 20.2211

2(C ) 0.219 0.203 0.240 20.040 0.108 0.265 20.355 20.483 0.333 20.054 20.099 0.230 20.358 0.0112

3(C ) 0.321 20.043 20.013 20.173 20.141 0.083 20.143 20.186 0.156 20.035 20.029 0.097 20.182 0.1283

4(C ) 20.233 20.149 20.273 20.192 0.205 20.559 0.489 0.451 20.339 20.023 0.043 20.350 0.445 20.2424

5(C ) 20.406 20.095 20.097 0.120 0.093 20.261 0.367 0.324 20.429 20.149 0.318 20.288 0.381 0.0675

6(C ) 0.449 0.068 0.421 0.006 20.095 0.664 20.760 20.747 0.524 0.180 20.251 0.358 20.535 0.2766

7(C ) 20.773 20.381 20.495 20.086 20.038 20.716 0.905 0.999 20.654 20.235 0.583 20.278 0.615 20.4047

c(B) Weight matrix between the third- and second-layer neurons

1(exo) 0.751 0.221 0.624 0.149 20.217 0.747 20.913 20.998 0.692 0.188 20.437 0.431 20.655 0.467

2(endo) 20.732 20.374 20.387 0.038 20.078 20.876 0.970 0.956 20.746 20.237 0.453 20.445 0.703 20.219

a The sum of the squared errors (E),0.011.b Scale factor50.4176.c Scale factor50.390.


T able 15aWeight matrices by the reconstruction learning

1 2 3 4 5 6 7 8 9 10 11 12 13 14

b(A) Weight matrix between the first- and second-layer neurons

1(C ) 0.000 0.000 0.000 0.000 0.000 20.002 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.0001

2(C ) 0.000 0.000 0.000 0.000 0.000 0.594 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.0002

3(C ) 0.000 0.000 0.000 0.000 0.000 0.004 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.0003

4(C ) 0.000 0.000 0.000 0.000 0.000 20.515 0.000 0.006 0.000 0.000 0.000 0.000 0.000 0.0004

5(C ) 0.000 0.000 0.000 0.000 0.000 20.626 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.0005

6(C ) 0.000 0.000 0.000 0.000 0.000 0.994 0.000 20.122 0.000 0.000 0.000 0.000 0.000 0.0006

7(C ) 0.000 0.000 0.000 0.000 0.000 20.998 0.000 0.744 0.000 0.000 0.000 0.000 0.000 0.0007

c(B) Weight matrix between the third- and second-layer neurons

1(exo) 20.001 20.001 20.001 20.001 20.001 0.998 0.001 20.577 20.001 20.001 20.001 20.001 20.001 20.001

2(endo) 0.001 0.001 0.001 0.001 0.001 20.998 0.001 0.577 0.001 0.001 0.001 0.001 0.001 0.001

a The sum of the squared errors (E),0.153.b Scale factor50.323.c Scale factor50.168.

6 .3.3. Practical application: the relationship other neuron, to the endo conformation. Therefore,13between C NMR chemical shift and the the conformation is determined by the neurons 6 and

conformation of norbornane and norbornene 8 propagating the information to two neurons in theThe reconstruction leaning method was applied to third layer with the types of connections (a, 2 a) and

the Kowalski problem. In accordance with the (2 b, b). The neurons in the third layer make theformer studies, we used 25 (nos. 1–25) out of 38 decision by combining them.data as training data and the others were used for The 6 and 8 neurons in the second layer have theprediction. The network structure was set to be strongest connections with the neurons 6 and 7 in theN(7,14,2). To avoid complexity, we do not adopt a first layer. Since the order of the neurons in the firstbias. The reconstruction learning, which consists of layer is made to correspond to that of the number offifty training cycles and on erasing procedure, was the carbon atoms, it is understood that the infor-repeated fifty times.Table 14shows the values of the mation on the chemical shifts at C and C plays the6 7

connections obtained by the usual backpropagation major role in deciding the endo/exo conformationsand Table 15,those by the reconstruction learning, of the derivatives of norbornene. This is consistentwhere the maximum value of the connection was with the chemical idea that the C and C carbon6 7

scaled to be 0.999 to compare them in the same level atoms are located near the substituent and that theof magnitude (the scale factor is given in each table) effect of the substituent on C may be reversed on6

(Table 16). C . Note here that the neuron 8 in the second layer7

We have shown in a previous paper that the connects only with neurons 6 and 7 in the first layer,predictions by the neural network were better than suggesting that the information of the endo/exothose by the linear learning machine and cluster conformation entailed on the neuron 6 of the secondanalysis. The results by the present method are the layer is corrected by the neuron 8.same. Without reconstruction, the information ob-tained through the learning phase is widely distribut- 6 .4. Descriptor mapping [42]ed among neurons. With reconstruction, however,connections are localized between the special neu- Descriptors in the QSAR/QSPR analysis are notrons. It is also understood that the neurons other than always linear to the output intensity. In addition to6 and 8 in the second layer have nothing to do with this, descriptors are often mutually dependent. Thethe resulting classification other than to optimize the neural network makes their analysis. Andrea andu values. Firing of neuron 1 in the last layer Kalayeh showed mutual dependencies among de-corresponds to the exo conformation and that of the scriptors in the QSAR of dihydrofolate reductase


T able 16 tion and develop their method to three-dimensionalComparison of the results by the backpropagation learning and analyses using the MR-type network.reconstruction learning methods

Backpropagation Reconstruction 6 .4.1. MethodExo Endo Exo Endo To make it easy to understand, let us consider the

case with two variables, i.e. the intensity (I) is a1 0.913 0.084 0.899 0.1012 0.989 0.009 0.973 0.027 function of variables,r and s. Using the minimum3 0.994 0.005 0.978 0.022 values ofr ands (r ands ) which are given by the0 04 0.844 0.155 0.855 0.145 training data, the difference in the intensities,DI(r)5 0.955 0.049 0.939 0.061

andDI(s), is obtained as6 0.961 0.039 0.955 0.0457 0.948 0.051 0.931 0.069

DI(r)5 I(r,s 1Ds)2 I(r ,s )8 0.964 0.034 0.947 0.053 0 0 0 (29)9 0.926 0.073 0.914 0.086 DI(s)5 I(r 1Dr,s)2 I(r ,s )0 010 0.968 0.030 0.935 0.06511 0.970 0.028 0.959 0.041 whereDr andDs are scanned in the regions given by12 0.974 0.026 0.962 0.038 the training data. By displayingDI(r) andDI(s), the13 0.841 0.161 0.759 0.241

mutual relationship between variables can be found.14 0.014 0.985 0.050 0.950If there are more than two variables, one variable is15 0.015 0.900 0.051 0.949

16 0.088 0.979 0.136 0.864 scanned as the descriptor and the others are treated17 0.020 0.979 0.063 0.937 all together as the background intensity. To show the18 0.017 0.984 0.049 0.951 usefulness and reliability of the descriptor mapping19 0.025 0.975 0.072 0.928

method, we examined the reproducibility of indi-20 0.014 0.986 0.046 0.954vidual mathematical functions from the mixed func-21 0.220 0.788 0.226 0.774

22 0.015 0.984 0.050 0.950 tion. They include simple linear combinations of23 0.067 0.926 0.115 0.885 linear functions and/or a nonlinear function. Since24 0.021 0.979 0.059 0.941 the number of samples in QSAR analysis is generally25 0.019 0.981 0.042 0.958

rather small, we used 30 sets of independent random26 0.703 0.293 0.707 0.293 numbers ofx, y, (andz). These numbers ranged from27 0.079 0.924 0.115 0.885

0.0 to 9.9. The threshold in the backpropagation28 0.906 0.091 0.904 0.09625learning was less than 10 in terms of the scaled29 0.025 0.974 0.070 0.930

unit. The a and b values were set at 4.0 and 1.0,30 0.981 0.018 0.966 0.034a31 0.708 0.293 0.629 0.371 respectively.

32 0.953 0.045 0.941 0.05933 0.009 0.990 0.041 0.959

6 .4.2. Examination using mathematical functions34 0.974 0.029 0.958 0.04235 0.063 0.941 0.100 0.900 w5 x1 2y1 3z36 0.008 0.992 0.038 0.962 First, using an ideal linear relationship we ex-37 0.825 0.177 0.738 0.262 amined whether or not the neural network could38 0.103 0.898 0.093 0.907

reproduce the fact that each variable is exactly lineara Error. Data 1–25 were used for training and those of 26–38 and independent of other variables. The network

are for prediction. structure wasN(4,10,1). One neuron in the first layerwas used as the bias neuron.

Taking variablex, for example, the range of actualinhibitors [43]. In the third layer they used one input values was sectioned into 11 parts, from 0 toneuron with a sigmoid function (this is not, there- 100%, while the ranges of other variables were alsofore, the MR-type network which is described here). sectioned in the same way. They are designated asAlthough they did not show the rationale of interpo- the descriptor intensity forx and the backgroundlation in their network, the results are suggestive. intensity, respectively. As the background intensity,We, therefore, discuss the rationale of the interpola- the lowest values for bothx andy were at 0% while


Fig. 14. Analysis of correlation between variables inw5x 12y13z.

the highest values were 100%.Fig. 14 shows a andx 5 y 5 z 510. The lowest value for each back-three-dimensional mapping for each variable. ground intensity was set at 0% and the highest at

It is clear that as each descriptor increases, the 100%.intensity, w, linearly increases in accord with the Fig. 15shows the obtained results. Three functionscoefficient. Such linearity is independent of the are reasonably separated. However, a small distortionvalues of other variables, resulting in flat planes with can be observed in the plane for descriptorx. Thisconstant gradients. The deviations are small enough may because thez value is large. However, suchto be ignored in practical applications. These are for distortion may be negligible in practical application.ideal linear relationships of variables with intensity We investigated this distortion and found that it was(w). If nonlinear correlation is included, the plane because the number of the training data was small: it

2would be deformedw 5 x 2 y 1 z . was greatly improved by increasing sample number2This function includes a nonlinear part (z ). It may to 60. Consequently, the neural network extracts the

be necessary to show whether the neural network characteristics of each function in a mixed function,normally reproduces such nonlinear function. The independently of other variables.z 5 x 1 10 exp

2network structure and parameters were the same as [2(y 25) /4].those in the former case. The number of sets of This function has a maximum point aty55random numbers was 30 and 2 fromx 5 y 5 z 5 0 (50%). The network structure wasN(3,10,1). The


2Fig. 15. Analysis of correlation between variables inw5x2y1z .

Fig. 16. Analysis of correlation betweenx and y variables inz5x110 exp [2(y25)2/4].

sample number was 32. However, to reproduce therefore, 441 (521321) points were predicted.Fig.smooth surfaces, the ranges of the maximum and 16 shows the results. Each function is beautifullyminimum values forx andy were sectioned into 21; reproduced without distortion. It is rather surprising


Fig. 17. Descriptor mapping of structural parameters of carboquinones. Here, ‘bg’ indicates the background intensity.


that only 30 sets of data were good enough to seems to be useful for analyzing the characteristicsreproduce such a complicated function (Fig. 17). of structural parameters in the QSAR/QSPR analy-

sis. Here, we show the results of application of this6 .4.3. Application to SAR analysis method to actual QSAR analysis. We used the data in

As we have seen, the descriptor mapping method which the least discrepancy was included. The

T able 17aInput data for carboquinones

bNo. R , R MR p p MR F R Activity1 2 1,2 1,2 2 1

1 C H , C H 5.08 3.92 1.96 2.54 0.16 20.16 4.336 5 6 5

2 CH , (CH ) C H 4.5 3.66 3.16 0.57 20.08 20.26 4.473 2 3 6 5

3 C H , C H 4.86 5 2.5 2.43 20.08 20.26 4.635 11 5 11

4 CH(CH ) , CH(CH ) 3 2.6 1.3 1.5 20.08 20.26 4.773 2 3 2

5 CH , CH C H 3.57 2.51 2.01 0.57 20.12 20.14 4.853 2 6 5

6 C H , C H 3 3 1.5 1.5 20.08 20.26 4.923 7 3 7

7 CH , CH OC H 3.79 2.16 1.66 0.57 20.04 20.13 5.153 2 6 5

8 R 5R 5CH CH OCON(CH ) 6.14 0.72 0.36 3.07 20.08 20.26 5.161 2 2 2 3 2

9 C H , C H 2.06 2 1 1.03 20.08 20.26 5.462 5 2 5

10 CH , CH CH OCH 2.28 1.03 0.53 0.57 20.08 20.26 5.573 2 2 3

11 OCH , OCH 1.58 20.04 20.02 0.79 0.52 21.02 5.593 3

12 CH , CH(CH ) 2.07 1.8 1.3 0.57 20.08 20.26 5.63 3 2

13 C H , CH(OCH )CH OCONH 4.24 0.98 20.52 1.5 20.04 20.13 5.633 7 3 2 2

14 CH , CH 1.14 1 0.5 0.57 20.08 20.26 5.663 3

15 H, CH(CH ) 1.6 1.3 1.3 0.1 20.04 20.13 5.683 2

16 CH , CH(OCH )C H 2.75 1.53 1.03 0.57 20.04 20.13 5.683 3 2 5

17 C H , CH CH OCONH 3.56 1.45 20.05 1.5 20.08 20.26 5.683 7 2 2 2

18 R 5R 5CH CH OCH 3.42 1.03 0.53 1.71 20.08 20.26 5.691 2 2 2 3

19 C H , CH(OC H )CH OCONH 4.23 0.98 20.02 1.03 20.04 20.13 5.762 5 2 5 2 2

20 CH , CH CH OCOCH 2.78 1.23 0.73 0.57 20.08 20.26 5.783 2 2 3

21 CH , (CH ) -dimer 1.96 2 1.5 0.57 20.08 20.26 5.823 2 3

22 CH , C H 1.6 1.5 1 0.57 20.08 20.26 5.863 2 5

23 CH , CH(OCH CH2OCH )CH OCONH 4.45 0.01 20.49 0.57 20.04 20.13 6.033 2 2 3 2 2

24 CH , CH CH(CH )OCONH 3.09 0.75 0.25 0.57 20.08 20.26 6.143 2 3 2

25 C H , CH(OCH )CH OCONH 3.77 0.48 20.52 1.03 20.04 20.13 6.162 5 3 2 2

26 CH , CH(C H )CH OCONH 3.55 1.25 0.75 0.57 20.08 20.26 6.183 2 5 2 2

27 CH , CH(OC H )CH OCONH 3.77 0.48 20.02 0.57 20.04 20.13 6.183 2 5 2 2

28 CH , (CH ) OCONH 3.09 0.95 0.45 0.57 20.08 20.26 6.183 2 3 2

29 CH , (CH ) OCONH 2.63 0.45 20.05 0.57 20.08 20.26 6.213 2 2 2

30 C H , (CH ) OCONH 3.09 0.95 20.05 1.03 20.08 20.26 6.252 5 2 2 2

31 CH , CH CH OH 1.78 0.34 20.16 0.57 20.08 20.26 6.393 2 2

32 CH , CH(CH )CH OCONH 3.09 0.75 0.25 0.57 20.08 20.26 6.413 3 2 2

33 CH3, CH(OCH )CH OCONH 3.31 20.02 20.52 0.57 20.04 20.13 6.413 2 2

34 H, N(CH ) 1.66 0.18 0.18 0.1 0.1 20.92 6.452 2

35 R 5R 5CH CH OH 2.42 20.32 20.16 1.21 20.08 20.26 6.541 2 2 2

36 CH , N(CH ) 2.13 0.68 0.18 0.57 0.06 21.05 6.773 2 2

37 CH , CH(CH )CH OH 2.47 20.13 20.63 0.57 20.04 20.13 6.93 3 2

a The data were taken from the literature[44].b Chronic injection, log (1/C).


example here concerns carboquinones (anticar- personal computer. This enables one to rotate thecinogenic agents). Carboquinones were synthesized three-dimensional graph to find the optimal point inby Nakao et al. and other groups and were developed relation to other descriptors such as the backgroundto an anticarcinogenic drug for the clinical media. A intensity.detailed QSAR study based on the Hansh methodwas carried out by Yoshimoto et al. We have alreadyused those data to compare the results of the neural7 . Concluding remarksnetwork with those of conventional QSAR tech-niques[44]. We used the same structural parameters Recently, the number of applications of ANNs inand the same compound numbers in the literature. the pharmaceutical sciences is increasing. SinceTable 17 shows the input data. The input data, optimization and prediction problems frequentlyphysicochemical parameters, are the molecular re- appear in this field, the hierarchy-type ANN is thefractivity constants (MR), hydrophobicity constant main target of application. Most articles which deal(p), substituent constants (F and R), as well as with application simply compare ANN with conven-MR and p . As biological data, we used the tional methods for prediction, fitting, etc? The opera-1,2 1,2

minimum effective dose (MED) on a chronic treat- tion is not dealt with. This article reviews ANNment schedule only. MED is the dose giving a 40% articles from the basic viewpoint of such operatingincrease in lifespan compared to the controls. characteristics. ANNs have outstanding abilities in

The input data are scaled to have values between both classification and fitting. The operation is0.1 and 0.9 and are fed to the network together with basically carried out in a nonlinear manner. Theconstant 1 for the bias. The network structure was nonlinearity has merits as well as a small number ofN(7,6,1) while the network parameters,a andb, are demerits. The reasons for the demerits are analyzed4.0 and 1.0, respectively. The iterative backpropaga- and their remedies are shown.tion learning was repeated until the sum of the error The operation of the neural network can be fullybecame less than 0.003 in terms of scaled units.Fig. expressed by mathematics. The mathematical rela-17 shows the results. Here, for example, background tionships of ANN’s operation and the ALS methodintensity of 10% indicates that all of the other as well as the multiregression analysis are reviewed.physicochemical parameters take 10% of the their ANN can be regarded as a function that transformsmaximum magnitudes. an input vector to another (output) one. We examined

At first glance, those descriptors behave nonlinear- the analytical formula for the partial derivative ofly and irregularly. Descriptor 1 (5parameter MR ) this function with respect to the elements of the input1,2

in Fig. 17 has the biggest contribution to the vector. This is a powerful means to determine theintensity—around 60–70% when the background relationship between the input and output—one canintensity is around 20–30%, while descriptor 2 find causes for the results. The reconstruction-learn-(5parameterp ) has a negative contribution to the ing method determines the minimum number of1,2

intensity and its maximum strength appears at 60% necessary neurons of the network and is useful tointensity of the background. One can analyze the find the necessary descriptors or trace the flow ofcharacteristics of each descriptor. information from the input to the output. Finally, the

This article is not intended to give a concrete descriptor-mapping method is reviewed. This is aanalysis of the present data but to give a method by useful method to find the nonlinear relationshipswhich the structural parameters can be analyzed. between descriptors or the output intensity andTherefore, we will not go into further detail. In descriptors.practice, the analysis may be carried on a computerdisplay and three descriptors may be analyzed simul-taneously, for example, MR ,p , and the back- A cknowledgements1,2 1,2

ground intensity. The neural network method takessome time in the learning phase. However, prediction The author thanks The Ministry of Education,by the trained network is rapid; even 1000 points can Culture, Sports, Science and Technology of Japan forbe handled in a second with a moderately hi-speed financial support.


[21] E . Tafeit, G. Reibnegger, Artificial neural networks inR eferenceslaboratory medicine and medical outcome prediction, Clin.Chem. Lab. Med. 37 (1999) 845–853.

[1] E .R. Kandel, J.H. Schwarz, Principle of Neural Science,[22] S . Agatonovic-Kustrin, R. Beresford, Basic concepts of

Elsevier, North-Holland, New York, 1982.artificial neural network (ANN) modeling and its application

[2] D .E. Rumelhart, J.L. McClelland (Eds.), Parallel Distributedin pharmaceutical research, J. Pharm. Biomed. Anal. 22

Processing, Explorations in the Microstructure of Cognition,(2000) 717–727.

Vols. 1 and 2, MIT Press, Cambridge, MA, 1986.[23] S . Nagl, Neural network models of protein domain evolution,

[3] J .J. Hopfield, Neural network and physical systems withHyle 6 (2000) 143–159.

emergent collective computational abilities, Proc. Natl. Acad.[24] C . Ochoa, A. Chana, Applications of neural networks in the

Sci. USA 79 (1982) 2445–2558.medicinal chemistry field, Curr. Med. Chem.: Central Ner-

[4] J .J. Hopfield, Neurons with graded response have collectivevous System Agents 1 (2001) 247–256.

computational properties like those of two-state neurons,[25] Y . Cai, J. Gong, Z. Cheng, N. Chen, Artificial neural networkProc. Natl. Acad. Sci. USA 81 (1984) 3088–3092.

method for quality estimation of traditional Chinese medi-[5] D .H. Acley, G.E. Hinton, T.J. Sejnowski, A Learningcine, Zhongcaoyao 25 (1994) 187–189, Chinese.Algorithm for Bolzmann Machine, Cognitive Sci. 9, 1985.

[26] Y .J. Qiao, X. Wang, K.S. Bi, X. Luo, Application of artificial[6] J . Zupan, J. Gasteiger, Neural Networks in Chemistry andneural networks to the feature extraction in chemical patternDrug Design, 2nd Edition, Wiley–VCH, Weinheim, 1999.recognition of the traditional Chinese medicine, Venenum[7] T . Kohonen, Analysis of a simple self-organizing process,bufonis, Yaoxue Xuebao 30 (1995) 698–701, Chinese.Biol. Cybern. 43 (1982) 59–62.

[27] L . Geng, A. Luo, R. Fu, J. Li, Identification of Chinese[8] T . Kohonen, Self-Organization and Associative Memory, 3rdherbal medicine using artificial neural network in pyrolysis–Ed, Springer, Berlin, 1989.gas chromatography, Fenxi Huaxue 28 (2000) 549–553,[9] S . Anzali, G. Garnickel, M. Krug, J. Sadowski, M. Wagener,Chinese.J. Gasteiger, Evaluation of molecular surface properties

[28] J . Bourquin, H. Schmidli, P. van Hoogevest, H. Leuenberger,using a Kohonen neural network, in: J. Devillers (Ed.),Basic concepts of artificial neural networks (ANN) modelingNeural Networks in QSAR and Drug Design, Academicin the application to pharmaceutical development, Pharm.Press, London, 1996.Dev. Technol. 2 (1997) 95–109.[10] J .P. Doucet, A. Panaye, 3D Structural information: from

[29] K . Takayama, M. Fujikawa, T. Nagai, Artificial neuralproperty prediction to substructure recognition with neuralnetwork as a novel method to optimize pharmaceuticalnetworks, SAR QSAR Environ. Res. 8 (1998) 249–272.formulations, Pharm. Res. 16 (1999) 1–6.[11] T . Aoyama, Y. Suzuki, H. Ichikawa, Neural networks applied

[30] R .C. Rowe, R.J. Roberts, Artificial intelligence in pharma-to pharmaceutical problems. I. Method and application toceutical product formulation: neural computing and emergingdecision making, Chem. Pharm. Bull. 37 (1989) 2558–2560.technologies, Pharm. Sci. Technol. Today 1 (1998) 200–205.[12] T . Aoyama, Y. Suzuki, H. Ichikawa, Neural networks applied

[31] T . Takagi, Pharmacometrics. New region in pharmaceuticalto structure–activity relationships, J. Med. Chem. 33 (1990)science, Farumashia 37 (2001) 695–699.905–908.

[32] W .S. McCulloch, W. Pitts, A logical calculus of the ideas[13] J . Devillers (Ed.), Neural Networks in Qsar and Drugimminent in nervous activity, Bull. Math. Biophys. 5 (1943)Design, Academic Press, London, 1996.115–133.[14] D .A. Winkler, D.J. Madellena, QSAR and neural networks in

[33] D .O. Hebb, The Organization of Behavior, Wiley, New York,life sciences, Ser. Math. Biol. Med. 5 (1994) 126–163.1949.[15] D . Manallack, D.J. Livingstone, Neural networks and expert

[34] T . Aoyama, H. Ichikawa, Reconstruction of weight matricessystems in molecular design, Methods Princ. Med. Chem. 3in neural networks. A method of correlating output with(1995) 293–318.inputs, Chem. Pharm. Bull. 39 (1991) 1222–1228.[16] S . Anzali, J. Gasteiger, U. Holzgrabe, J. Polanski, J. Sadow-

[35] T . Aoyama, Y. Suzuki, H. Ichikawa, Neural networks asski, A. Techentrup, M. Markus, The use of self-organizingapplied to quantitative structure–activity relationship analy-neural networks in drug design, 3D QSAR Drug Design 2sis, J. Med. Chem. 33 (1990) 2583–2590.(1998) 273–299.

[36] M . Minsky, S. Papert, Perceptrons: An Introduction To[17] D .J. Maddalena, Applications of soft computing in drugComputational Geometry, MIT Press, Cambridge, MA,design, Exp. Opin. Ther. Pat. 8 (1998) 249–258.1969.[18] T . Savid, D.J. Livingstone, Neural networks in drug discov-

[37] T . Aoyama, H. Ichikawa, Basic operating characteristics ofery: have they lived up to their promise?, Eur. J. Med. Chem.neural networks when applied to structure–activity studies,34 (1999) 195–208.Chem. Pharm. Bull. 39 (1991) 358–366.[19] M .E. Brier, G.R. Aronoff, Application of artificial neural

[38] I . Moriguchi, K. Komatsu, Adaptive least-squares classifica-networks to clinical pharmacology, Int. J. Clin. Pharmacol.tion applied to structure–activity correlation al antitumorTher. 34 (1996) 510–514.mitomycin derivatives, Chem. Pharm. Bull. 25 (1977) 2800–[20] J . Rui, S. Ling, Neural networks model and its application in2802.clinical pharmacology, Zhongguo Linchuang Yaolixue Zazhi

13 (1997) 170–176, Chinese. [39] T . Aoyama, H. Ichikawa, Obtaining the correlation indices


between drug activity and structural parameters using a structural parameters in QSAR analysis: descriptor mappingneural network, Chem. Pharm. Bull. 39 (1991) 372–378. using neural networks, SAR QSAR Envirn. Res. 1 (1993)

[40] T . Aoyama, H. Ichikawa, Neural networks as nonlinear 115–130.structure–activity relationship analyzers. Useful functions of [43] T .A. Andrea, H. Kalayeh, Applications of neural networks inthe partial derivative method in multilayer neural networks, quantitative structure–activity relationships of dihydrofolateJ. Chem. Inform., Conmut. Sci. 32 (1992) 492–500. reductase inhibitors, J. Med. Chem. 34 (1991) 2824–2836.

[41] B .R. Kowalski, Chemometrics: Theory and Applications, in: [44] M . Yoshimoto, H. Miyazawa, H. Nakao, K. Shinkai, M.ACS Symposium Series, Vol. 53, American Chemical Socie- Arakawa, Quantitative structure–activity relationships in 2,5-ty, Washington, DC, 1977, p. 43. bis(1-aziridinyl)-p-benzoquine derivatives against leukemia

[42] H . Ichikawa, A. Aoyama, How to see characteristics of L-1210, J. Med. Chem. 22 (1979) 491–496.

hierarchy neural networks as applied to pharmaceutical problems

Documents